Reading Time

3 Minutes

Daniil Osokin

Developer at OpenCV.ai

Revolutionizing Object Tracking: "Tracking Everything Everywhere All at Once" Paper Review

Tech track #1. "Tracking Everything Everywhere All at Once" review

The "Tracking Everything Everywhere All at Once" paper, a collaborative work by Cornell University, Google Research, and UC Berkeley, offers a breakthrough solution to the problem of tracking any point in video footage. The method maps each pixel from every frame into a common 3D space, tracing its trajectory across time. This method is not designed for real-time tracking but for in-depth recorded video analysis. In this article, we highlight the most exciting points of this paper.

June 15, 2023

Introduction

Last week Cornell University, Google Research, and UC Berkeley published a paper with an intriguing title "Tracking Everything Everywhere All at Once". It proposes a solution for the track any point problem. As the name hints, it tracks everything everywhere - meaning pixels across all frames, even if they are occluded. See the beautiful visualizations from the paper’s OmniMotion website:

So, if you select a specific pixel, you can find its coordinates on all previous frames, as well as on all next frames. That is fantastic! However, we believe (while no code is available) this pixel should be a good feature to track, e.g. corner, not from single-color or low-textured areas.

Two things to note:

The first thing is that this algorithm is intended to work on the whole video all at once. It runs an optimization process given all video frames, so it is not designed for real-time tracking. However, it might be useful for the analysis of surveillance videos or sports analytics.
‍

Second thing: it needs to run an external algorithm for supervision to perform the track optimization process. The authors compute the RAFT optical flow for all frame pairs before the optimization.

Idea

The authors propose to map each object pixel from every frame to a single common 3D space (the canonical 3D volume G in the paper). Thus one point in this space corresponds to a 3D trajectory across time. They use two algorithms to create this 3D space:

NeRF to model the volume density and color prediction.

Invertible neural networks capture the camera parameters and scene motion from different frames.

‍

The algorithm overview is shown in the picture below:

The loss functions are flow loss, photometric loss, and penalty for large displacement between consecutive 3D point locations to ensure temporal smoothness of the 3D motion. OmniMotion compares tracking results on the target TAP-Vid benchmark and shows an impressive improvement.

‍

Object tracking remains one of the hard computer vision problems. This paper significantly advances the state-of-the-art using complex but elegant algorithms. It is a great research paper - and we are looking forward to trying it live!

Paper: https://arxiv.org/pdf/2306.05422.pdf

Let's discuss your project

Book a complimentary consultation

Tech track #1. "Tracking Everything Everywhere All at Once" review

Introduction

Two things to note:

Idea

Let's discuss your project

Read also

AI in Architecture

AI in driverless cars

Ethics in artificial intelligence and computer vision

Cookie Settings

Tech track #1. "Tracking Everything Everywhere All at Once" review

Introduction

Two things to note:

Idea

Let's discuss your project

Read also

AI in Architecture

AI in driverless cars

Ethics in artificial intelligence and computer vision