mediapipe 2 stages architecture question about paper

Solution: Objectron

Im reading the papers and the online documentation of Objectron.

There is a sentence in the paper that I don't understand. https://arxiv.org/pdf/2012.09988.pdf

About the 2 stage architecture you wrote that: "We also designed a new two-stage architecture for 3D object detection. The first stage estimates a 2D crop of the object of the size 224 × 224 using SSD model[20][16], followed by a second stage model using EfficientNet-Lite [27] architecture which uses the 2D crop to regress the key points of the 3D bounding box. !!! We use a similar EPnP algorithm as in [15] to lift the 2D predicted keypoints to 3D. !!!"

And this is the image of the EfficientNet-Lite network: effcent

But on the online page there is this schema(where it looks like the 3d points are obtained with the epnp alg): sch

I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.

Besides I would like to know in the Javascript API which architecture is used. 1 stage or 2 stage? Is one of the 2 architeture deprecated and not used anymore?

Jun 22 '22 15:06 UX3D-mazzini

Hi @UX3D-mazzini , To build objectron architecture we have used stage 2 architecture .

Jul 13 '22 06:07 sureshdagooglecom

thanks for the response, and what about the previous question: 'I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.'

So you never used the 1 stage architecture?

Jul 13 '22 07:07 UX3D-mazzini

1st stage = 2D object detector with SSD 2nd stage = keypoint prediction + EPNP

The model above takes a 224x224 crop (containing the object) and produces 9 2D keypoints (Note after the embedding we have a fully connected layer and then regress 18 values (x,y value for 9 keypoints). 9 Keypoints are the 8 keypoints for the boundingbox + the center of the box. These points are the 2D projection of the 3D bounding box. We haven't estimated the depth Z yet. The EPNP algorithm lift these points to 3D and produces the 9 3D keypoints (x,y,z) value, which is the 3D bounding box.

Aug 05 '22 20:08 ahmadyan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

Jan 13 '23 06:01 google-ml-butler[bot]

Closing as stale. Please reopen if you'd like to work on this further.

Jan 20 '23 07:01 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

Jan 20 '23 07:01 google-ml-butler[bot]

mediapipe mediapipe copied to clipboard

2 stages architecture question about paper

mediapipe
mediapipe copied to clipboard