The 6dof meaning refers to an object's six degrees of freedom in 3D space, allowing for movement and rotation across three axes, a fundamental concept for our exploration of object position prediction. In this guide, we will dissect the intricate process of object position prediction in 3D space, discussing the mechanics of rotation, translation, and scale, focusing on metrics for evaluating these predictions.
The 6DoF estimation task encompasses predicting an object's position in 3D space (X, Y, Z coordinates), along with its rotation around these axes, called yaw, pitch, and roll.
Though various approaches exist to predict these, measuring their effectiveness is no simple task. Quality metrics in 2D space tasks have reached a certain level of consensus, but in 3D, things are a bit more complicated.
Before we start reviewing metrics, let’s look closer at this task.
3D pose estimation begins with an RGB (and sometimes RGBD) image that features the target object. The aim is to predict the object's 6D position, representing the rigid transformation from the object's coordinate system to the camera's coordinate system.
A complete 6D pose consists of two elements - the 3D rotation (3x3 matrix R) of the object and the 3D translation (3x1 vector t). For calculation convenience, they can both be padded to 4x4 matrices.
This task is also helpful when we deal with multi-object tracking.
Rotation involves a rotation matrix (R), which essentially breaks down into three 2D rotation matrices. The rotation matrix is defined by the yaw, pitch, and roll angles:
where α, β, and γ are yaw, pitch, and roll angles, respectively.
In the translation matrix, we take into account the distances in the x, y, and z coordinates (𝒗x, 𝒗y, 𝒗z).
Translation transformation matrix T in the 3D space is a 4D matrix with the following structure:
where 𝒗x, 𝒗y, 𝒗z are the translation distances in x, y, and z.
In other terms, the translation is a vector (t), which, when added to the original position, shifts the entire model in 3D space
Evaluation of these transformations is usually done via two groups of metrics: those measuring the whole transformation matrix and those measuring R, T, and S matrices separately. We will overview the 2 most common metrics - one from each group.
The most common overall metric is the average distance (ADD) or ADD(s) for symmetric objects. Here, the goal is to measure the distance between the Ground Truth (GT) 3D point cloud and the predicted 3D point cloud resulting from the transformation.
As a first step, predicted and GT 3D point clouds, are calculated from a base model using predicted and GT transformation matrices. Then the distance is measured for each point, and the mean distance is calculated. The mean distance is calculated for each object.
As a second step, the threshold of mean distance is picked. Then the percentage of objects with the mean distance below this threshold is calculated. This number is called ADD accuracy. ADD(s) is the same metric for symmetric objects.
However, in certain cases, evaluating rotation, translation, and scale separately can provide deeper insights into the error sources.
The translation error is typically measured as the distance between the predicted and GT vectors.
The scale error is calculated by dividing the GT scale by the predicted scale.
Calculating the rotation error, on the other hand, is more challenging. Rotation matrices belong to the 3D rotation group, often denoted SO(3). Therefore the difference between two rotation matrices, Rgt and Rpred, can be calculated by the metric of distance in SO(3).
There are several approaches to defining a distance function or metric in a 3D rotation group. You can take a closer look at them in this paper in section 3. Some of them are based on quaternions, and some use the direct comparison of matrices, or for example, deviation from identity matrix. We will overview the most representable and intuitive method here.
This method calculates the solid angle between Rgt and Rpred matrices:
From this equation angle of rotation can be easily calculated:
As a result, we get a solid angle, which would represent the overall error angle in 3D space. The advantage of this metric is that it gives a spatial visual representation of the error.
Evaluation is one of the critical processes in Deep Learning, and the right choice of evaluation gt metrics is crucial. Only with reliable and interpretable metrics can we not only make the right decisions but also explain them to our colleagues or customers.
Discover how integrating AI can elevate your projects across different sectors, thanks to the specialized computer vision services from OpenCV.ai. Our team is passionate about utilizing AI Services to innovate and redefine practices within numerous industries.