Z-Axis Offset in Pose Estimation for XR

Open MarchesiGiorgio opened this issue 7 months ago • 1 comments

Hi, I truly appreciate the quality of this work—it's one of the most comprehensive and well-documented resources I've encountered in the field of computer vision.

I was developing a tutorial on model-based tracking using a standard RGB camera. (link) When testing it with my own dataset, I was pleased to see that the tracking worked successfully, as shown in the attached image.

However, I'm currently facing an issue related to exporting the estimated pose for use in Unity, as I’m working within an extended reality (XR) environment. Specifically, I noticed that the Z component of the translation vector is unexpectedly large—around one meter—which seems too high given the setup.

My question is: where does this Z value come from? Is it directly derived from the solvePnP function using the point correspondences I provided as an initial guess? Or could it be influenced by some misconfiguration in the .wrl model I'm using?

I'm relatively new to these topics, so I apologize if some of these questions seem naive. What confuses me the most is that the tracking results appear visually correct, yet the pose values seem off. It's likely that I’m misinterpreting how the pose data should be handled in this context.

These are my pose estimation values: -0.9898437464 -0.0659613885 0.1259303493 -0.03436481746 -0.0497446236 0.9905470153 0.1278361566 0.01510272792 -0.1331721821 0.1202734623 -0.9837679931 0.9193867587 0 0 0 1

May 09 '25 09:05 MarchesiGiorgio

Specifically, I noticed that the Z component of the translation vector is unexpectedly large—around one meter—which seems too high given the setup.

Usually when you have a discrepancy in the estimated pose, it comes from bad camera intrinsic parameters (or bad 3d object points). Then you would have an estimated translation vector off by a factor and not only the Z-component of the translation vector.

My question is: where does this Z value come from? Is it directly derived from the solvePnP function using the point correspondences I provided as an initial guess? Or could it be influenced by some misconfiguration in the .wrl model I'm using?

See: Real-time markerless tracking for augmented reality: the virtual visual servoing framework, Andrew Comport , Eric Marchand , Muriel Pressigout , François Chaumette

The tracker tracks moving edges from frame to frame (or KLT points for the KLT tracker). The objective function consists in minimizing pixel image errors by computing the corresponding 3D camera displacement, starting from an initial camera pose. solvePnP computes the camera pose from corresponding pairs of 3D object points and 2D projected images points and does not need an initial camera pose (yet a refinement method can be used afterward).

Note:

The vpMbGenericTracker tracks visible object contours or planar and textured object faces. In your screen capture, the wrl object model contains lots of triangulated faces that do not correspond to visible object contours.

For more recent works, see:

May 14 '25 12:05 s-trinh