mediapipe How to process the output of pose detection and pose landmark models in a standalone c++ project

I am working on building a standalone cpp project for detecting pose. In the project,I am using 2 of the mediapipe tflite models (pose_detection and pose_landmark), and the output dimensions of the models are attached below. For the pose detection, we have two outputs (2254, 12) and (2254, 1). What do these values correspond to, and how do we do the postprocessing on these values? On the mediapipe webpage, it says that the output of Pose detector is similar to Face detector + (human body center, radius, and rotation). Similarly, for Pose Landmarks, we have 5 outputs - (1,195), (1,1), (1,256,256,1), (1,64,54,39), and (1,117). As we have understood that (1,1) is a classifier, and (1,256,256,1) is a segmentation mask. However, the other 3 output values are not clear. Here, it says that the pose landmark model detects 33 landmarks in pixel and world space, where each landmark has 4 values (x,y,z, visibility). I am assuming that the shape should correspond to a total of 4 (x,y,z,visiblity) * 33 (landmarks) * 2 (pixel and world space). Can you please let me know how to make sense of these two model outputs, and also the post-processing related to them?

Pose Detection Output

pose_detection

Pose Landmark Output

pose_landmark

Jul 20 '22 03:07 UtsaChattopadhyay

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

Aug 09 '22 14:08 google-ml-butler[bot]

Closing as stale. Please reopen if you'd like to work on this further.

Aug 16 '22 14:08 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

Aug 16 '22 14:08 google-ml-butler[bot]

Closing as stale. Please reopen if you'd like to work on this further.

Aug 27 '22 14:08 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

Aug 27 '22 14:08 google-ml-butler[bot]

Hi @UtsaChattopadhyay , did you figure this out?

Sep 22 '22 19:09 JesperStenberg

Yes we did

On Fri, Sep 23, 2022, 03:39 JesperStenberg @.***> wrote:

Hi @UtsaChattopadhyay https://github.com/UtsaChattopadhyay , did you figure this out?

— Reply to this email directly, view it on GitHub https://github.com/google/mediapipe/issues/3532#issuecomment-1255468837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKS5W3JWFSY4WK5SJYNT643V7SYW7ANCNFSM54CDCVXQ . You are receiving this because you were mentioned.Message ID: @.***>

Sep 23 '22 06:09 UtsaChattopadhyay

Yes we did … On Fri, Sep 23, 2022, 03:39 JesperStenberg @.> wrote: Hi @UtsaChattopadhyay https://github.com/UtsaChattopadhyay , did you figure this out? — Reply to this email directly, view it on GitHub <#3532 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKS5W3JWFSY4WK5SJYNT643V7SYW7ANCNFSM54CDCVXQ . You are receiving this because you were mentioned.Message ID: @.>

Do you mind sharing it regarding the pose_landmark model? We get a [1, 195] output from "Identity". I've come to understand that this represent 39 arrays of [x, y, z, visibility, presence] (33 points on the body + 6 extra points for next frame tracking).

This works very well when the input scene is easy, x and y tracks perfectly to the image. But if the person is partially of screen the tracking fails completely, which it doesn't do in the Mediapipe examples.

Do you have any insights?

Sep 23 '22 10:09 JesperStenberg

@JesperStenberg or @UtsaChattopadhyay , did one of you figure it out for the pose_landmark model? Where did you find the information about 33 + 6 landmarks? Are those 6 at the end or the beginning of the array?

Jan 21 '23 18:01 EinePriseCode

It was a while ago and I don't have it available, but i'm pretty sure that those 6 are at the end of the array. The thing that threw me off was that the person needs to be centred in the image for the model to work.

If you haven't checked this link it might have some good info.

Jan 21 '23 19:01 JesperStenberg

Thanks @JesperStenberg, that was an important hint. Unfortunately I cant find any doc which explains the output in more detail which makes implementing harder and less clean.

Jan 21 '23 20:01 EinePriseCode

If we have a high enough frame rate could we find the acceleration and velocity of not only the center but the limbs also? Random 3 am idea hoping to get some feedback and minimum frame rate for usable results

Feb 09 '23 21:02 Lakshyadevelops

Hello @UtsaChattopadhyay, We are upgrading the MediaPipe Legacy Solutions to new MediaPipe solutions However, the libraries, documentation, and source code for all the MediapPipe Legacy Solutions will continue to be available in our GitHub repository and through library distribution services, such as Maven and NPM.

You can continue to use those legacy solutions in your applications if you choose. Though, we would request you to check new MediaPipe solutions which can help you more easily build and customize ML solutions for your applications. These new solutions will provide a superset of capabilities available in the legacy solutions. Thank you

May 05 '23 10:05 kuaashish

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

May 13 '23 01:05 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

May 20 '23 01:05 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

May 20 '23 01:05 google-ml-butler[bot]

mediapipe mediapipe copied to clipboard

How to process the output of pose detection and pose landmark models in a standalone c++ project

Pose Detection Output

Pose Landmark Output

mediapipe
mediapipe copied to clipboard