Bounding boxes are useful for counting, tracking, localization, and other problems where the shape of the object is not important. Object detectors also benefit from fast inference and better convergence (requiring less data and time to train a model).
However, many tasks have required the use of detectors instead of segmentation models:
• Robotics/Autonomous Driving: We need to know where an obstacle or operating object is. Previously, we used detectors due to limited computing resources.
• Tracking: The IoU of boxes between frames (one of the simplest and most effective characteristics for tracking) is often low when tracking sparse, non-square objects.
• Also many tasks require segmentation. Especially image modification (inpainting, image matting), 3d reconstruction, something else?
• Modern architectures are fast enough on modern hardware.
Segmentation provides richer information than bounding boxes. With new approaches, we can train segmentation models faster and with less data. However, classical task-specific models have two big disadvantages:
• Low Reusability: For example, a COCO-trained model cannot segment eyes because it lacks that class in the training set.
• Limited Training Data: If we want to segment roofs on buildings, we need a dataset with labeled roofs.
Pretraining techniques exist, but LLMs have demonstrated the generalization power derived from diverse data. Such general or foundation models can solve new unseen tasks without training, using only our prompts.
Facebook AI Research described how they created a foundation segmentation model in their paper.
SAM's architecture is a classical encoder-decoder with two encoders for images and prompts and one decoder for predicting masks. The image encoder is a pretrained Vision Transformer that converts images into embedding vectors, effectively combining prompts with these embeddings.
The prompt encoder has two parts:
1. Dense prompts (masks) embedded using convolutions and summed element-wise with the image embedding.
2. Sparse prompts embedded as:
2.1. texts processed by pretrained CLIP,
2.2. points (background or on an object) and boxes (up-left and right-down points) prompts represented by positional encoding.
And also there are learned embeddings for each prompt type. Point type information is in embedding.
The mask decoder consists of two modified transformer decoder blocks, combining image and prompt embeddings to produce the final mask. The lightweight decoder is essential because for one image many prompts are processed during training.
The chosen prompt formats are simple and understandable but ambiguous, especially points and boxes.
Pointing to a backpack could mean segmenting the pocket, backpack, or the backpack’s owner. Adding more foreground and background points can resolve this uncertainty but makes usage more difficult and slow. Authors suggest predicting three masks (whole, part, and subpart) for a single prompt and using only the masks with the lowest loss for backpropagation during training.
Since we do not know which of the three masks to use during inference, a special head in the network generates a score representing the predicted IoU between each predicted mask and the object it covers. If multiple prompts are used, the loss is calculated between the ground truth and the fourth output mask (always predicted but not used in a single prompt scenario).
Release SAM is trained on image-mask pairs. To improve quality and decrease ambiguity, the authors suggest an iterative prompt sampling technique during training. Initially, the training input is a random point or a bounding box taken with small noise from the ground truth mask. In subsequent iterations, points from the incorrectly predicted regions (foreground for false negatives and background for false positives) with the logits mask from the first iteration and the highest predicted IoU are used as prompts. The training process involves generating many prompts (8 iterations) for one image. This is why the authors designed a fast, small decoder and prompt encoders.
It is important to note that text prompt input is not included in the released model (according to authors due to low results) and the uncertainty text problem is not sufficiently described. As a result, we do not describe it in our post.
Another important component of SAM is data usage and collection approach.
In LLMs, pretraining techniques like next-token prediction and special techniques like Chain-Of-Thoughts improve chat and answer-to-question abilities. In Segment Anything, authors use a pretrained image encoder and suggest a "data annotation <-> model" loop in three stages for segmentation ability data collection and training:
1. Assisted-manual stage: In this stage, annotators label foreground/background objects using points. The weak first version of SAM, pretrained on several publicly available datasets, makes a prediction. Annotators add more points to improve segmentation masks and can also correct masks manually. This stage is repeated several times. After every iteration (6), SAM is retrained and the model size increases. In the end, they collected 4.3M masks from 120k images.
2. Semi-automatic stage: Increase the diversity of masks and label more object parts. This stage resulted in 5.9M masks in 180k images. The model was retrained on the newly collected data five times.
3. Fully automatic stage: This stage is used only for SA-1B dataset generation. The main idea is to sample many points on the image and crops, sort out very small, low-scoring masks, and apply NMS (Non-Maximum Suppression) using the predicted IoU as a score.
It is hard to fairly compare SAM with other approaches since they are using other types of input and trained for specific datasets. Let's consider SAM performance on some segmentation tasks.
Unsurprisingly, ViTDet-H performs better, but SAM does remarkably well on several metrics under condition.
SAM is reasonably close, though certainly behind ViTDet.
By visualizing the outputs, it is observed that SAM masks are often qualitatively better than those of ViTDet. SAM consistently outperforms ViTDet in the human study.
In the real world, we usually have video as input. Often, engineers analyzing video run pipelines on separate frames every nth millisecond, second, or minute, as it is simpler and speeds up inference. However, in such cases, we ignore that current and previous frames are connected, and we can use this connection. Let's think about what video processing can offer us, not just individual images.
1. Processing separate frames leads to mask inconsistency between frames. New frame masks should be agreed with previous.
2. Using as input more than just an input frame (for example only masks from the previous frame) increase memory consumption. But we can make the model smaller since masks usually do not change dramatically between frames, utilizing richer temporal information
3. If we want to reuse masks from previous frames it is a good idea to track objects. It can be implemented internally and a model learns how to map old masks, new predictions and how to combine them, or by separate algorithm.
4. We have a lot of unlabeled videos available that we can use for training.
This is the first article in the series. Next time we will go deeper into technical aspects of implementation!