mediapipe icon indicating copy to clipboard operation
mediapipe copied to clipboard

2 stages architecture question about paper

Open UX3D-mazzini opened this issue 2 years ago • 3 comments

Solution: Objectron

Im reading the papers and the online documentation of Objectron.

There is a sentence in the paper that I don't understand. https://arxiv.org/pdf/2012.09988.pdf

About the 2 stage architecture you wrote that: "We also designed a new two-stage architecture for 3D object detection. The first stage estimates a 2D crop of the object of the size 224 × 224 using SSD model[20][16], followed by a second stage model using EfficientNet-Lite [27] architecture which uses the 2D crop to regress the key points of the 3D bounding box. !!! We use a similar EPnP algorithm as in [15] to lift the 2D predicted keypoints to 3D. !!!"

And this is the image of the EfficientNet-Lite network: effcent

But on the online page there is this schema(where it looks like the 3d points are obtained with the epnp alg): sch

I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.

Besides I would like to know in the Javascript API which architecture is used. 1 stage or 2 stage? Is one of the 2 architeture deprecated and not used anymore?

UX3D-mazzini avatar Jun 22 '22 15:06 UX3D-mazzini

Hi @UX3D-mazzini , To build objectron architecture we have used stage 2 architecture .

sureshdagooglecom avatar Jul 13 '22 06:07 sureshdagooglecom

thanks for the response, and what about the previous question: 'I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.'

So you never used the 1 stage architecture?

UX3D-mazzini avatar Jul 13 '22 07:07 UX3D-mazzini

1st stage = 2D object detector with SSD 2nd stage = keypoint prediction + EPNP

The model above takes a 224x224 crop (containing the object) and produces 9 2D keypoints (Note after the embedding we have a fully connected layer and then regress 18 values (x,y value for 9 keypoints). 9 Keypoints are the 8 keypoints for the boundingbox + the center of the box. These points are the 2D projection of the 3D bounding box. We haven't estimated the depth Z yet. The EPNP algorithm lift these points to 3D and produces the 9 3D keypoints (x,y,z) value, which is the 3D bounding box.

ahmadyan avatar Aug 05 '22 20:08 ahmadyan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] avatar Jan 13 '23 06:01 google-ml-butler[bot]

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] avatar Jan 20 '23 07:01 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Jan 20 '23 07:01 google-ml-butler[bot]