Fresh off the press, the OpenCV.ai team is excited to announce a collaboration with Snap Inc., the popular Camera and Social Media company. If you’ve used popular Snapchat filters, you may know you can author new filters using the Snap Lens Studio. In July 2023, the OpenCV.ai team partnered with the Lens Studio to create real-time Optical Character Recognition (OCR) capabilities in the software, and we just launched in Lens Studio version 4.53! Read on.
A fast and accurate Optical Character Recognition Template has been a user request for years. For Snapchat users, this Lens offers the convenience of seamlessly integrating dynamic text into their communication and interaction experiences. We hope you are curious to check it out yourself, but we are happy to explain how it came about.
The popularity of mobile phone cameras combined with high-performance chipsets have made real time Optical Character Recognition (OCR) technology — popular in language translation, automatic captioning and content capture. With just a few taps on your screen it enables seamless text extraction from images, screenshots, and even live camera feeds, opening up a world of possibilities.
Before we get started on creating OCR Lens for Snapchat, the popular multimedia messaging app, let's take a moment to explore how Snapchat Lens works.
Snapchat Lens is an interactive feature within the Snapchat app that transforms photos and videos with animations and special effects in real time.
For example, Snap, with its wide community, offers an assortment of face modifiers that allow you to morph your appearance in creative ways. Whether you want to add virtual makeup, change your hairstyle, or even distort your facial features, these Lenses are sure to ensure endless entertainment and selfies.
Snapchat Lens has become increasingly popular among users of all ages, and brands and advertisers use it to create interactive and engaging campaigns.
Deep Learning models power Snapchat Lenses, enabling real-time recognition of facial features, landmarks, and facial expressions. These models work seamlessly to bring AR elements to life and enhance the user experience, captivating millions of Snapchat users worldwide.
But what about Text Recognition? The answer is a stack of efficient algorithms and fast deep learning models. Let's dive into the OCR pipeline and look through each stage of it!
First, the raw input image undergoes a series of transformations to enhance its suitability for text extraction. This implementation includes image normalization, resizing, and padding. By standardizing the input images, the OCR pipeline is primed for more accurate character recognition.
At this stage, a Detector model is employed to analyze the preprocessed input image and classify each pixel as belonging to a text block or not. The classification output is a segmentation mask that serves as the foundation for the next stages of the pipeline.
We are using the English ultra-lightweight PP-OCRv3 model from the PaddleOCR repository. The Detector model is the third generation supporting this architecture, enabling efficient and accurate identification of potential text regions in real-time.
The segmentation mask generated by the Detector model acts as a blueprint for text segmentation. Bounding boxes are estimated to delineate lines of text within the image. The calculated bounding boxes play a pivotal role in isolating individual characters for subsequent recognition.
Further, the input texture is cropped around each detected rectangle. These cropped sections are passed to the Text Recognition model. At this stage, it is crucial to determine the orientation of each cropped section, whether it is horizontal or vertical. If the text is vertically oriented, we align it horizontally so that the Text Recognition model can understand it correctly.
We are using the en_number_mobile_v2.0_rec model from the PaddleOCR repository, which supports both English and number recognition. The Text Recognition model is the first generation of the Convolutional Recurrent Neural Network (CRNN) architecture, which is a combination of convolutional neural networks and recurrent neural networks.
The Text Recognition model extracts characters that are compiled into a list of strings. Our model is unique because it operates with whole sentences instead of individual words. This approach offers several benefits: it reduces the number of runs required to complete the recognition process and allows the model to recognize gaps between words, which is an important capability for many applications. However, it’s important to note that this stage is more resource-intensive than Text Detection, as the model operates for each individual cropped rectangle.
Great! Now that we've explored all stages of the OCR pipeline, let's take a closer look at how we integrate the OCR pipeline with Lens Studio, a robust and user-friendly desktop application that empowers creators to construct their unique Lenses from the ground up. Read more about Lens Studio.
Lens Studio comes with a variety of pre-made deep learning models, but you're not restricted to just those. You can easily incorporate your own models using SnapML, a powerful tool that enables advanced image processing capabilities to be implemented.
Before creating an OCR Template in Lens Studio, we need to convert our models, Text Detector, and Text Recognition, to the SnapML-compatible format called ONNX. It's a cross-platform format that enables models to be trained in one framework and then easily used in another, promoting flexibility.
First, we extracted initial weights for models from the PaddleOCR library, which specializes in Optical Character Recognition tasks. These initial weights are stored in the PaddlePaddle format.
Next, we break down the integration into two key steps:
1. Weights conversion from PaddlePaddle to ONNX format
2. Incorporating models in ONNX format into SnapML by modifying the models’ architecture to improve compatibility
To see this integration in action, refer to the accompanying Google Colab Notebook. Within the notebook, you will find step-by-step instructions on how to launch the OCR inference using SnapML on Python.
Now that we have integrated these Deep Learning models to LensStudio, let's get ready for some magic!
The OCR Template is based on the Object Detection Template, which provides a way to place UI elements on the screen based on the bounding boxes of the objects.
In the OCR Template, the Text Detector works in real-time, providing users with instant feedback on the text present in the image in the same way how Object Detection Template works.
However, the algorithm for Text Recognition requires additional processing time, which may cause delays in the recognition process. In order to address this issue, we use a post-capture approach — this is how the user experience works for the OCR Lens:
1. The user sees the highlighted detected text on the screen
2. The user can press the "Recognize" button to get the text as a string of characters to copy to the clipboard
Furthermore, the OCR Template provides numerous parameters that can be adjusted. For example, you can modify the model's prediction confidence threshold or set a maximum number of detections per captured frame.
If you're interested in OCR Template customization, you might want to take a look at the OCR Guide for more information about API and Template UI. Enjoy exploring!
For instance, you can improve the OCR template by incorporating several features that facilitate text interaction. Integrating a text magnification function can assist people with visual impairments in effortlessly reading the surrounding text.
The OCR Template on Snapchat offers a convenient way to use text for communication and interaction. By effectively implementing the OCR Template API, Snapchat has taken a step towards enhancing the augmented reality experience for its diverse user base. It enables users to explore a new way of expressing themselves and engaging with content.
We are thrilled to be working with Snap!
We can't wait to see all the awesome ways people use this template to create beautiful and fun solutions, for imaginative self-expression.