mediapipe
mediapipe copied to clipboard
2 stages architecture question about paper
Solution: Objectron
Im reading the papers and the online documentation of Objectron.
There is a sentence in the paper that I don't understand. https://arxiv.org/pdf/2012.09988.pdf
About the 2 stage architecture you wrote that: "We also designed a new two-stage architecture for 3D object detection. The first stage estimates a 2D crop of the object of the size 224 × 224 using SSD model[20][16], followed by a second stage model using EfficientNet-Lite [27] architecture which uses the 2D crop to regress the key points of the 3D bounding box. !!! We use a similar EPnP algorithm as in [15] to lift the 2D predicted keypoints to 3D. !!!"
And this is the image of the EfficientNet-Lite network:
But on the online page there is this schema(where it looks like the 3d points are obtained with the epnp alg):
I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.
Besides I would like to know in the Javascript API which architecture is used. 1 stage or 2 stage? Is one of the 2 architeture deprecated and not used anymore?
Hi @UX3D-mazzini , To build objectron architecture we have used stage 2 architecture .
thanks for the response, and what about the previous question: 'I don't understand if the 9 points are obtained with the EPnP algorithm (like the 1 stage architecture) or those points are predicted directly by the efficient net architecture as shown in the image of the architecture.'
So you never used the 1 stage architecture?
1st stage = 2D object detector with SSD 2nd stage = keypoint prediction + EPNP
The model above takes a 224x224 crop (containing the object) and produces 9 2D keypoints (Note after the embedding we have a fully connected layer and then regress 18 values (x,y value for 9 keypoints). 9 Keypoints are the 8 keypoints for the boundingbox + the center of the box. These points are the 2D projection of the 3D bounding box. We haven't estimated the depth Z yet. The EPNP algorithm lift these points to 3D and produces the 9 3D keypoints (x,y,z) value, which is the 3D bounding box.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.
Closing as stale. Please reopen if you'd like to work on this further.