The promise of AI/CV and ML is to learn insights from the data. And the quality and quantity of it is what usually defines the results. Empirically, up to 90% of AI-based solution’s success relies on the quality of the training data. The AI model only learns patterns in the data it has seen.
That's why it's important to talk about the challenges in collecting data.
First of all, the data should be as close as possible to the real world distribution — which the model will see during production scenarios.
Let's imagine we're building a computer vision system for self-driving cars. To train a computer vision model, we collect terabytes of data from cameras and sensors in California, where we have an operations site.
The AI has learned how to drive based on that data. But what happens if a car with such an autopilot ends up, for example, in Alaska? There's a lot of snow! But the AI knows nothing about snow, in the original dataset there were only sunny streets and palm trees. And after the car drives in, for example, Vietnam?
When collecting data, it is important to consider all major target use cases — they must be sufficiently covered by the dataset. Otherwise, the model will fail even in the most basic scenarios.
Tip. When evaluating usage scenarios, it is important to consider the "human factor". In other words, it is useful to think about how people might try to "hack" the system.
For example, when creating a system to check a person's face against their passport photo, it is useful to consider the desire of attackers to defraud the system. What happens if a person points the smartphone camera not at his or her face, but at a large photo of another person?
This is also a crucial matter. First of all, the higher the resolution of the camera, the more information we can potentially extract and the better the model will work. A VHS-quality camera will not be enough to recognize a person's face with 99% accuracy from 10-15 meters away. From such a distance and with such quality it is good if we can recognize a human being at all.
Tip. You can estimate the quality of the data on your own. The rule of thumb is — if you can recognize a person or a person's face (or what needs to be recognized) in the video, then the AI can surely handle it.
The cameras should be of the same specification (or very close in characteristics) and installed similarly to how they will be installed after the start of operation. Otherwise, we risk facing unexpected problems in production.
It's worth taking into account the environment or setup change. For example, you use one camera for computer vision now, and in a year, after the reconstruction of the room, you will use another (f.e. a smart computer vision camera). If the new camera has a different resolution, color scheme, FoV, and location — the model will have to be adapted and retrained.
Of course, we can help you organize the right data collection to train the model for you. However, no one understands better than you the key scenarios in which the model will be used, and no one knows the technical constraints that affect camera operations.
Tip. There are different ways to get data if you need help with data collection for training on your own. For example, in some cases you can buy data for model training, in other cases you can get it by crowdsourcing.
Sometimes a dataset is difficult to assemble or annotation. For example, we are building a system to evaluate MRI results using computer vision. In order to train the model, we need to obtain a set of MRI images, annotate the data with the help of experienced medical specialists, and often also we need a lot of permissions. In case of an error, it is costly to recollect such a dataset.
Tip. In addition to data quantity and data quality management, it is advised to think about the organization of data collection and storage. Automation of data collection and processing, backups, and storage are very important. Our data experts will help you to set up the data collection process and organize the data handling pipeline.
First of all, this data is protected by a license. Often using the information for training is simply illegal, even if the photos and videos appear to be distributed freely.
Tip. Do not use people's social media data for training. It can lead to millions of dollars in fines. Big companies have both lawyers and ways to prove data misuse.
In addition, such a dataset is often irrelevant. For example, if you create a model for estimating a person's age from a photo and train it on photos of actors, then obviously the model will be constantly incorrect — because the actors on the dataset obviously look different from the general population.
Working on AI models and computer vision systems is a process of continuous improvement. First, we train the model for a basic scenario, and then we add more complex cases iteratively.
Imagine we teach the car to drive on bright, clean streets at low speed — think of quiet streets in the suburbs of Los Angeles. But real streets, of course — these aren't quiet suburbs. So the model is being further trained. For example, using data from nighttime driving. Or drives on streets with active bicycle and pedestrian traffic. And then we teach the model how to behave correctly with large animals on the road.
In short, AI training is a complex but sufficiently well-defined process. And with each stage, the quality of the model improves. It is important to note that it is impossible to create a model with 100% accuracy. New complex, unknown cases can always appear. We cannot achieve the impossible, but we strive for it!
For that, we need data again. Proper data helps us not only to teach the model but also to realize that it is trained correctly.
To do this, we divide the original dataset into three parts. The model developers use two of them:
Training set we use to train the model. Drawing an analogy to human learning, this data is textbook information.
Validation set is used to monitor and adjust the training process. These are the examples to be solved in the textbook. The correct answer can be found at the end of the book.
But there is also a third one, the Test set. It is needed to test the final model. Unlike Training and Validation datasets, neither the model nor the developers has access to the Test set before the final step. We run the model through it and check the quality but cannot use it to make training decisions. These are the tasks on the exam. The student has to solve similar ones, but not this one.
Let's get to know each other and chat about your task. Together we'll discuss what data might be useful in your case, and how we can get it — so that computer vision and AI can be of maximum use to you.