mediapipe How to fetch the absolute depth of hand 3D coordinate?

How to fetch the absolute depth of hand 3D coordinate?

Open youngstu opened this issue 3 years ago • 10 comments

Considering that the absolute coordinates of hand are not obtained, how can 3d hand tracking be applied to augmented reality scenes such as AR glasses?

Jun 22 '21 07:06 youngstu

Hi @youngstu, Could you please share your use case with augmented reality. And absolute coordinates of hand not obtained, Can you elaborate more with any video or error logs. Thanks!

Jun 22 '21 10:06 sgowroji

Using 3D joint and human-computer interaction control, such as Nreal AR glasses.

https://www.youtube.com/watch?v=9LxOlsHu3r8&ab_channel=UploadVR

If there is no absolute depth, it is impossible to judge whether the button has been touched.

Jun 22 '21 10:06 youngstu

I'd be also interested to see formula for that.

Jun 23 '21 06:06 gb2111

https://google.github.io/mediapipe/solutions/hands.html#multi_hand_world_landmarks here You can find the world coordinates of Hands

Dec 16 '21 05:12 sgowroji

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

Dec 23 '21 05:12 google-ml-butler[bot]

@sgowroji thanks for identifying that. Overall this is a great project.

I'm afraid, though, the origin of those world coordinates is only relative to the hand itself. This is interesting and meaningful if your interaction is confined only to motion of the fingers relative to the wrist (but not to motion of the entire hand, in 3D space, say relative to the body or the camera...). As soon as you try to measure translation of the entire hand in 3D space, it does not appear possible to do so in metric space... :(

It seems the best you can do is use screen coordinates for (x,y) and some sort of abstraction of z coordinates. Upon close analysis and broad testing of the z depth coordinates from multi_hand_landmarks, I find they do not behave consistently with respect to positive or negative signs, as well as magnitudes. Sometimes, I'm shocked to find the values of the index finger or wrist (e.g.) to be positive or negative; it's clear that more negative is generally closer to the camera, and more positive is generally further, but unfortunately, we also do not see consistent linearity here.

It seems like for Hand Landmarks we need something like the Metric 3D space in the Face Geometry module, as it pertains to the Face Landmarks model. To confirm, there is no true global 3D coordinate of the Hand landmarks, relative to the camera, in metric space, anywhere?

Simply put, we would like a model to input images, and output a 3D coordinate for every landmark, where (0,0,0) equals the camera, and (x,y,z) for each landmark are measured in meters from the camera.

Happy holidays!

Dec 28 '21 22:12 legel

PS just to follow up with a helpful workaround, and also potentially a good pointer for implementing the requested global hand 3D coordinates (relative to camera and/or body), consider:

MediaPipe Holistic is able to simultaneously compute MediaPipe Hands and MediaPipe Pose, and provide outputs of both APIs.
MediaPipe Pose is able to (roughly) deliver global 3D metric coordinates of 4 keypoints per hand, including wrist. Video: https://user-images.githubusercontent.com/1915466/147910871-b83a0c83-5bd9-4ac9-b547-6e75d3f861fd.mov

So, for tracking hand motion relative to the body, MediaPipe Pose is today the best bet.

However, in testing rotations of the hands, as in the above video, we see apparent failure by MediaPipe Pose to effectively track the thumb vs. the pinky. Here, MediaPipe Hands is the clear winner, by far.

Going forward, it makes sense to do more on the backend to combine these coordinate systems. I would basically propose that MediaPipe Holistic (at least) exposes to the user a way to get body-relative metric 3D coordinates for all MediaPipe Hands landmarks.

So far, it seems like that can be done by (pseudocode):

MediaPipeHolistic:BODY_CENTERED_HAND_COORDINATES = 
MediaPipePose:POSE_WORLD_LANDMARKS:AverageValueOfHandCoordinates + MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS

I think above is a crude approximation.

MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS apparently has as its origin "the hand’s approximate geometric center"... Intuitively, what this means is, the relative positioning of fingers to the hand could remain constant, even while the hand is moving relative to the body, and MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS would record no motion. Alternatively, if all of the landmarks were tracked in a body-centered coordinate system, for that same action, one could still recognize that the relative positioning of fingers was constant, but they could also easily track the motion of the hand in 3D space (key objective here).

Jan 03 '22 08:01 legel

To the the coordinates with the camera as a reference point consider using cvzone. It's a python library which uses mediapipe and openCV at its core.

Feb 09 '22 10:02 paulnegz

@legel Thanks for the clear description of the problem, calling this Metric 3D hand coordinates makes it easy to discuss the problem OP stated.

You are correct that the hand landmarks can be used to achieve the 3D model, e.g. the x, y can be unprojected into 3D space with respect to some camera/view. The unproject function is usually a part of yours 3D rendering engine. The problem is with the z coordinate which is confusing for many users see https://github.com/google/mediapipe/issues/742

The positive/negative values of the z coords are ok, as those are relative to the wrist. You can notice that the wrist's values are different in magnitude. In my runtime the values are really small e.g. 0.9e-7 while the rest is like +-0.002. So if we can denormalise this wrist z coordinate, then we should be able to unproject properly into 3D space.

Also the camera not necessarily is at (0,0,0). In my case the world origin is there and camera is -100units on z axis.

Apr 21 '22 08:04 MiroslavPetrik

Considering that the absolute coordinates of hand are not obtained, how can 3d hand tracking be applied to augmented reality scenes such as AR glasses?

To get the 3D coordinates in world space ("real" 3D coordinates) you need to perform two steps.

Take a look at the code from Python Solution API for webcam input - I'll be referring to that.

Solve the Perspective-n-Point problem like so: success, rotation_vector, translation_vector = cv2.solvePnP(model_points, image_points, camera_matrix, distortion, flags=cv2.SOLVEPNP_SQPNP) where:

model_points you can get from results.multi_hand_world_landmarks
image points from results.multi_hand_landmarks
to get the camera matrix and the distortion ideally you would calibrate your camera, but you can use approximate values like so (just substitute you frame width and height):

# pseudo camera internals
frame_height, frame_width, channels = (720, 1280, 3)
focal_length = frame_width
center = (frame_width/2, frame_height/2)
camera_matrix = np.array(
                         [[focal_length, 0, center[0]],
                         [0, focal_length, center[1]],
                         [0, 0, 1]], dtype = "double"
                         )
distortion = np.zeros((4, 1))

you can set various flags, but the SQPNP seemed to perform the best

Apply the found transformation to the model coordinates like so:

transformation = np.eye(4)  # needs to 4x4 because you have to use homogeneous coordinates
transformation[0:3, 3] = translation_vector.squeeze()
# the transformation consists only of the translation, because the rotation is accounted for in the model coordinates. Take a look at this (https://codepen.io/mediapipe/pen/RwGWYJw to see how the model coordinates behave - the hand rotates, but doesn't translate

# transform model coordinates into homogeneous coordinates
model_points_hom = np.concatenate((model_points, np.ones((21, 1))), axis=1)

# apply the transformation
world_points = model_points_hom.dot(np.linalg.inv(transformation).T)

and then in world_points you have the 'real' 3D coordinates of the hand

Jul 02 '22 22:07 koegl

Considering that the absolute coordinates of hand are not obtained, how can 3d hand tracking be applied to augmented reality scenes such as AR glasses?

To get the 3D coordinates in world space ("real" 3D coordinates) you need to perform two steps.

Take a look at the code from Python Solution API for webcam input - I'll be referring to that.

Solve the Perspective-n-Point problem like so: success, rotation_vector, translation_vector = cv2.solvePnP(model_points, image_points, camera_matrix, distortion, flags=cv2.SOLVEPNP_SQPNP) where:

model_points you can get from results.multi_hand_world_landmarks

image points from results.multi_hand_landmarks

to get the camera matrix and the distortion ideally you would calibrate your camera, but you can use approximate values like so (just substitute you frame width and height):
# pseudo camera internals
frame_height, frame_width, channels = (720, 1280, 3)
focal_length = frame_width
center = (frame_width/2, frame_height/2)
camera_matrix = np.array(
                         [[focal_length, 0, center[0]],
                         [0, focal_length, center[1]],
                         [0, 0, 1]], dtype = "double"
                         )
distortion = np.zeros((4, 1))
you can set various flags, but the SQPNP seemed to perform the best

Apply the found transformation to the model coordinates like so:
transformation = np.eye(4)  # needs to 4x4 because you have to use homogeneous coordinates
transformation[0:3, 3] = translation_vector.squeeze()
# the transformation consists only of the translation, because the rotation is accounted for in the model coordinates. Take a look at this (https://codepen.io/mediapipe/pen/RwGWYJw to see how the model coordinates behave - the hand rotates, but doesn't translate

# transform model coordinates into homogeneous coordinates
model_points_hom = np.concatenate((model_points, np.ones((21, 1))), axis=1)

# apply the transformation
world_points = model_points_hom.dot(np.linalg.inv(transformation).T)
and then in world_points you have the 'real' 3D coordinates of the hand

Can this two steps method be applied to get real-world 3D coordinates in the case of the pose model as well? @koegl @sgowroji

Nov 16 '22 11:11 ridasaleem0

PS just to follow up with a helpful workaround, and also potentially a good pointer for implementing the requested global hand 3D coordinates (relative to camera and/or body), consider:

MediaPipe Holistic is able to simultaneously compute MediaPipe Hands and MediaPipe Pose, and provide outputs of both APIs.

MediaPipe Pose is able to (roughly) deliver global 3D metric coordinates of 4 keypoints per hand, including wrist. Video: https://user-images.githubusercontent.com/1915466/147910871-b83a0c83-5bd9-4ac9-b547-6e75d3f861fd.mov

So, for tracking hand motion relative to the body, MediaPipe Pose is today the best bet.

However, in testing rotations of the hands, as in the above video, we see apparent failure by MediaPipe Pose to effectively track the thumb vs. the pinky. Here, MediaPipe Hands is the clear winner, by far.

Going forward, it makes sense to do more on the backend to combine these coordinate systems. I would basically propose that MediaPipe Holistic (at least) exposes to the user a way to get body-relative metric 3D coordinates for all MediaPipe Hands landmarks.

So far, it seems like that can be done by (pseudocode):
MediaPipeHolistic:BODY_CENTERED_HAND_COORDINATES = 
MediaPipePose:POSE_WORLD_LANDMARKS:AverageValueOfHandCoordinates + MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS
I think above is a crude approximation.

MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS apparently has as its origin "the hand’s approximate geometric center"... Intuitively, what this means is, the relative positioning of fingers to the hand could remain constant, even while the hand is moving relative to the body, and MediaPipeHands:MULTI_HAND_WORLD_LANDMARKS would record no motion. Alternatively, if all of the landmarks were tracked in a body-centered coordinate system, for that same action, one could still recognize that the relative positioning of fingers was constant, but they could also easily track the motion of the hand in 3D space (key objective here).

Hi @legel Just wondering, I have Mediapipe pose landmarks from Mp_pose and Hand landmarks from Mediapipe_Hands. I want to convert or map the hand landmarks to the body centric one so that I can have a full BVH, but since the Hand landmarks of Hand and Body differs because calculated with a different geometric center, do you know any way to do the mapping? Thanks

Nov 25 '22 08:11 UtsaChattopadhyay

Thanks to the excellent reply from @koegl I've been able to get quite good 3D tracking using opencv solvePnP. Thanks!

I've been able to use it to create an interactive 3D demo.

mpv-shot0001

Video of me grabbing a ball in 3D space.

https://user-images.githubusercontent.com/367139/221155956-9b4b0b7b-d78e-418f-8904-1bf7cd7d1066.mp4

A visualisation of how well aligned the points are to the 2D mediapipe image points. The green spheres are world points in open GL and the coloured circles with white out lines are drawn using the mediapipe utils. The alignment is pretty good in my opinion.

https://user-images.githubusercontent.com/367139/221160498-58e577c5-e632-4114-a562-33ecec232306.mp4

Here's a python code for the above video, rendering in OpenGL with pygame. You may need to adjust the focal_length variable for your own camera. Maybe one could use the iris distance measuring to estimate the focal length? Who knows?

https://gist.github.com/eldog/9012ce957be26934044131daffc25c73

(You can press m to show the 2d marker positions and compare the difference, b resets the ball)

Feb 24 '23 10:02 eldog

Really cool @eldog! Indeed, the perspective-n-point mapping from @koegl is the right way to solve this, and the approximation he provides (for a camera matrix) should be good enough for most applications.

I was delighted to play with the code from @eldog and get it running locally on a MacBook. To install, all I had to do was (coming from a computer with a mostly blank slate): pip install opencv-python pip install pygame pip install mediapipe or for Mac's with M1 pip install mediapipe-silicon pip install PyOpenGL

Then, after downloading the mediapipe_hands_world_space.py file from the gist link of @eldog above, run it in a terminal: python mediapipe_hands_world_space.py

I can verify it works great.

Feb 24 '23 15:02 legel

@legel glad it works on Mac and with your camera, I'm on Linux and using logitech c920. Thanks for providing the installation instructions too!

I'm working on a web implementation using OpenCV.js for solvePnP, and will hopefully be able to post a web demo when that's ready.

I think, for anyone else looking into this, perspective-n-point mapping is definitely the approach to take, and given that all the data to do it is provided in the results object, maybe it is the Mediapipe developers intended way too, but possibly omitted from the official docs/API as you must have/guess the camera's intrinsics.

Feb 24 '23 18:02 eldog

Thanks to the excellent reply from @koegl I've been able to get quite good 3D tracking using opencv solvePnP. Thanks!

I've been able to use it to create an interactive 3D demo.

Video of me grabbing a ball in 3D space.

A visualisation of how well aligned the points are to the 2D mediapipe image points. The green spheres are world points in open GL and the coloured circles with white out lines are drawn using the mediapipe utils. The alignment is pretty good in my opinion.

Here's a python code for the above video, rendering in OpenGL with pygame. You may need to adjust the focal_length variable for your own camera. Maybe one could use the iris distance measuring to estimate the focal length? Who knows?

https://gist.github.com/eldog/9012ce957be26934044131daffc25c73

(You can press m to show the 2d marker positions and compare the difference, b resets the ball)

Has anyone succeded using this with results.pose_landmarks and results.pose_world_landmarks, results I'm getting are not coherent... Thanks!

Mar 18 '23 13:03 joansc

Anybody tried using LiDAR depth and segmentation masks from ARKit to enhance 3D hand detection? I have an entirely heuristic-based algorithm here.

Mar 30 '23 13:03 philipturner

@koegl @eldog @legel

Considering that the absolute coordinates of hand are not obtained, how can 3d hand tracking be applied to augmented reality scenes such as AR glasses?

To get the 3D coordinates in world space ("real" 3D coordinates) you need to perform two steps.

Take a look at the code from Python Solution API for webcam input - I'll be referring to that.

Solve the Perspective-n-Point problem like so: success, rotation_vector, translation_vector = cv2.solvePnP(model_points, image_points, camera_matrix, distortion, flags=cv2.SOLVEPNP_SQPNP) where:

model_points you can get from results.multi_hand_world_landmarks

image points from results.multi_hand_landmarks

to get the camera matrix and the distortion ideally you would calibrate your camera, but you can use approximate values like so (just substitute you frame width and height):
# pseudo camera internals
frame_height, frame_width, channels = (720, 1280, 3)
focal_length = frame_width
center = (frame_width/2, frame_height/2)
camera_matrix = np.array(
                         [[focal_length, 0, center[0]],
                         [0, focal_length, center[1]],
                         [0, 0, 1]], dtype = "double"
                         )
distortion = np.zeros((4, 1))
you can set various flags, but the SQPNP seemed to perform the best

Apply the found transformation to the model coordinates like so:
transformation = np.eye(4)  # needs to 4x4 because you have to use homogeneous coordinates
transformation[0:3, 3] = translation_vector.squeeze()
# the transformation consists only of the translation, because the rotation is accounted for in the model coordinates. Take a look at this (https://codepen.io/mediapipe/pen/RwGWYJw to see how the model coordinates behave - the hand rotates, but doesn't translate

# transform model coordinates into homogeneous coordinates
model_points_hom = np.concatenate((model_points, np.ones((21, 1))), axis=1)

# apply the transformation
world_points = model_points_hom.dot(np.linalg.inv(transformation).T)
and then in world_points you have the 'real' 3D coordinates of the hand

I am wondering the mathematical meaning of these codes. Based on the definition of PNP (See opencv doc), the camera-word points should be derived from multiplying image points with transformation matrix instead of inverse of transformation matrix.

I think the PNP should be solved by following:

model_points = np.float32([[-l.x, -l.y, -l.z] for l in hand_world_landmarks])
image_points = np.float32([[l.x * frame_width, l.y * frame_height] for l in hand_landmarks])
success, rvec, tvec, = cv2.solvePnP(
                    model_points,
                    image_points,
                    camera_matrix,
                    distortion,
                    flags=cv2.SOLVEPNP_SQPNP
                )
R, _ = cv2.Rodrigues(rvec)
transformation = np.eye(4, dtype=np.float32)
transformation[:3,3] = tvec.squeeze()
model_points_hom = np.concatenate((model_points, np.ones((21, 1))), axis=1)
world_points_in_camera = (model_points_hom @ transformation.T)[:,0:3]
# world_points_in_camera = model_points @ R.T + tvec.T

However, the result hand skeleton is not as correct as the quoted method. This made me very confused. Could you help me explain it?

BTW, I still can not understand why the rotation is not accounted in your codes.

Apr 04 '23 07:04 rzy0901

Hello @youngstu, Are you still looking for resolution on this issue?

May 29 '23 09:05 kuaashish

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Jun 06 '23 02:06 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Jun 14 '23 01:06 github-actions[bot]

mediapipe mediapipe copied to clipboard

How to fetch the absolute depth of hand 3D coordinate?

mediapipe
mediapipe copied to clipboard