mediapipe
mediapipe copied to clipboard
HAND WORLD LANDMARKS collapse for the back of the hand
Have I written custom code (as opposed to using a stock example script provided in MediaPipe)
None
OS Platform and Distribution
Linux Ubuntu 20
MediaPipe Tasks SDK version
0.10.9
Task name (e.g. Image classification, Gesture recognition etc.)
Hand landmark detection
Programming Language and version (e.g. C++, Python, Java)
Python
Describe the actual behavior
In mediapipe v0.10.9, when detecting and visualizing the back of the hand, the 3d landmarks of the palm happen to collapse, especially the finger MCPs, producing weird and unusable results. This happens consistently across different lighting, poses, hands.
Describe the expected behaviour
The 3d landmarks should be consistent even when the back of the hand is shown. They should certainly not collapse, at least..
Standalone code/steps you may have used to try to get what you need
None -- simply run the hand landmarker as described in https://developers.google.com/mediapipe/solutions/vision/hand_landmarker/python and visualize the 3d output (world landmarks)
Other info / Complete Logs
This issue is the same as https://github.com/google/mediapipe/issues/3994, reproduced with the new release (mediapipe v.0.10.9)
attaching here a couple of examples, also see https://github.com/google/mediapipe/issues/3994 for more (same issue)
wondering if anyone has a workaround/solution for this!
many thanks
Thanks for raising this. This looks like an issue with the model, but unfortunately we currently do not have plans to update our models.
Hello @schmidt-sebastian!
I've been digging a bit more on this topic and wanted to share some interesting findings
- the model generally struggles with world landmarks for the back of the hand
- however, the hand model is successful at detecting hand world landmarks when the fingers are mostly pointing up (positions in Figure 1.)
- In other poses (see Figure 2) the world landmarks undergo a strange collapse.
- I wonder: how the model may be successful with 1. and not successful with 2.? The inputs are conceptually the same, most computer vision models should be able to handle the same input if rotated.
- With this intuition in mind, I tried to rotate the input frames in Fig. 2. such that the hand would point in the same direction as Fig. 1. This produced almost the same exact result as the non-rotated version (no improvements!)
- Surprised by this behavior, I dug into the code and realized that mediapipe is already rotating the hand internally according to a similar logic here -- this is why 5) is not producing any improvements -- cause there's already a rotation/alignment in the mediapipe graph.
At this point, I suspect that there could be a bug in how this internal rotation/alignment is performed when the back of the hand is shown to the model! In my mind, this is the only explanation for points 4) and 5).
Wondering if there is a rather easy fix at inference time for this issue, instead of having to re-train the model.
many thanks for your work, hope this is useful and can ring a bell in someone's mind for a quick inference fix!