dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

3D object detection

Open hoangsep opened this issue 1 year ago • 5 comments

What do you guys think about using multiple cameras with dinov2 for 3D object detection for robotics? Does it make sense?

hoangsep avatar Jan 28 '24 03:01 hoangsep

the model takes one image as an input. You can process your multiple image sequentially, but then they wouldn't share any information. There is probably better models out there for that, but it could still be interesting to try

ccharest93 avatar Jan 30 '24 17:01 ccharest93

That's certainly possible. @hoangsep We can work on this together.

dingkwang avatar Jan 31 '24 06:01 dingkwang

@ccharest93 are you aware of any better model for this task? I am a total noob so I am not sure how this can be done. I wonder how companies like Tesla do 3D object detection.

I am thinking of something like stitching multiple camera image together (maybe side by side) and run them through the network? Or have multiple networks running in parallel, then take all the output (from 1 of the top layers) and pass them though a second network.

hoangsep avatar Jan 31 '24 09:01 hoangsep

@dingkwang I would love to. I am a total noob so I probably won't be able to do much, but I would love to explore this.

hoangsep avatar Jan 31 '24 09:01 hoangsep

I haven't looked at 3D models, you would probably need something more than stitching. Models are great at learning but you want to give them as much prior information as possible. Stitching two images together kinda defeats that purpose, since the model would have to learn to unstitch them first (not to mention the poor scaling as image number increases; transformer networks dont scale linearly with input size). I do like the idea of first passing each image through a normal model like dino and then doing something with the resulting patch embeddings so to create information channels between similar patches. As for the exact architecture, thats something youd have to figure out yourself. I good starting point would be setting up this model in inference mode, passing your image sets through it and then doing statistical analysis on the resulting patch embeddings

ccharest93 avatar Jan 31 '24 11:01 ccharest93