mediapipe Avoiding predicting occluded (hand) landmarks

Currently, being a neural model, all landmarks are predicted ― regardless of whether they are in view or not. This applies both when landmarks are self-occluded by the person's hand, as well as when the hand is partially outside of the image viewport. The neural model as per its paradigm of training will always predict all of its outputs.

But as the model is trained on a very large proportion of synthetic data, it could/should be possible to train the model to also differentiate whether a landmark is in sight or not ― in terms of the synthesized viewport, while employing geometric projection to the synthetic hand vis-a-vis the simulated camera position. Then, one can hopefully add another binary vector representing occlusion/visibility to the training regime, and thusly upstream uses of the model outputs may opt to ignore occluded landmarks in those natural scenarios where reliance on occluded landmarks is not favorable:

Although occluded landmarks do provide signal ― that signal may reflect the data distribution more than be very relevant to the input at hand. This suggestion implies curtailing the impact of occluded landmarks for upstream applications and/or upstream models.

A counter-rationale may be that occlusions can to an extent be interpolated by cross-frame (i.e. temporal) algorithms or architectures, meaning if we extend mediapipe's landmark prediction to train and predict over the temporal domain, which currently it does not, as it currently predicts the hand landmarks via only an indirect temporal signal of the predicted palm location.

Since the hand landmarks are very far from providing a shape of the palm region (as opposed to e.g. the face mesh model) it may seem rather noisy to apply a geometric hand model to remove those occluded predictions as an upstream task to the pipeline prediction, although such an alternative solution as the one just mentioned may be perhaps feasible in applications which model a particular person's hand from multiple angles over time, leveraging only modest assumptions about palm bio-mechanics.

That said, having described those alternatives, teaching a neural model which feeds on synthetic data (as well as real-world human annotated data) that a landmark is not observed may seem like a solid path to improve the signal at the cost of just a few more model parameters perhaps.

caveat: it does require strict annotation rules for the cases that enough of the finger segment is present but not the center point that the landmark is supposed to refer to.

Jan 22 '22 11:01 matanox

It would be great to have visibility vector in the output for cases when the face is occluded by any kind of object. In this case it would possible to get a photo of the face from a video stream only when it's fully visible.

Sep 14 '22 11:09 lukesolo

Hello @matanster, We are upgrading the MediaPipe Legacy Solutions to new MediaPipe solutions However, the libraries, documentation, and source code for all the MediapPipe Legacy Solutions will continue to be available in our GitHub repository and through library distribution services, such as Maven and NPM.

You can continue to use those legacy solutions in your applications if you choose. Though, we would request you to check new MediaPipe solutions which can help you more easily build and customize ML solutions for your applications. These new solutions will provide a superset of capabilities available in the legacy solutions. Thank you

May 04 '23 09:05 kuaashish

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

May 12 '23 01:05 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

May 19 '23 01:05 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

May 19 '23 01:05 google-ml-butler[bot]

mediapipe mediapipe copied to clipboard

Avoiding predicting occluded (hand) landmarks

mediapipe
mediapipe copied to clipboard