Suggestions for Small Object Detection
Any suggestions to configuration for detecting objects as small as 5x5 pixels? With YOLOv8 I can add a Pyramid Pooling Layer(P2) that works really well and on top of that using the "mosaic" augmentation additionally helps. Any suggestions would be greatly appreciated.
are you asking about ways to do it within the existing interface or approaches that could be used to modify the model? mosaic and other augmentations also will work with this model, we just don't need them to beat yolo on the metrics we care about so we didn't include them in the repo. you can also upscale your image either during or post training and expect better results on small objects, but that comes with a latency hit.
can you say a bit more about your goal?
are you asking about ways to do it within the existing interface or approaches that could be used to modify the model? mosaic and other augmentations also will work with this model, we just don't need them to beat yolo on the metrics we care about so we didn't include them in the repo. you can also upscale your image either during or post training and expect better results on small objects, but that comes with a latency hit.
can you say a bit more about your goal?
The goal is to use this in a production Nvidia Deepstream application on real-time RTSP video streams from PTZ cameras. Currently we are using YOLOv8 from Ultralytics but we want to steer away from using anything Ultralytics as their AGPL-3.0 license puts the models we generate into a gray area and we aren't allowed to share the training code. Uses for our model are to detect small flying objects such as drones or birds as well as larger targets such as vehicles/humans. Because we are using PTZ cameras, the model needs to be able to handle multiscale objects. Our current application will detect objects(YoloV8), automatically take control of the PTZ camera to start tracking the object and then it will auto zoom in to get enough pixels on target where a secondary image classifier(EfficientNet) will classify the type of object(multi-rotor, bird, fixed-wing, truck, car, human). The model is used on both EO(Color) and IR cameras(SWIR,MWIR,LWIR) and works well enough but we like the license of RF-DETR. As an example, we get a radar hit on a target, slew the camera and it will be small like so(16x12 pixels):
The Nvidia Deepstream application will automatically take control of the PTZ camera and start to track and zoom in to get enough pixels(24x16 pixels):
Once enough pixels are on target the algorithm will try to maintain the amount of pixels until it get's a solid classification(37x23 pixels):
Unfortunately we aren't able to use any slicing techniques, like SAHI, as we have a real-time requirement and slicing is too compute intensive. Thus, we are looking for a model architecture that can handle very small(5x5 pixels to 10x10 pixels) to large scales of objects. Any tips or tricks for training RF-DETR to achieve this would be greatly appreciated.
neat!
what size yolo are you using and at what resolution?
there's some reason to believe that DETR detectors in general are particularly weak on small objects, and we run our detector at a lower resolution which may compound that. you can of course run ours at a higher resolution but you'll get a latency hit.
it may be that our model is not a good fit for you but this seems like a fairly niche case and a good opportunity to optimize our model. have you tried training an rf-detr yet?
neat!
what size yolo are you using and at what resolution?
there's some reason to believe that DETR detectors in general are particularly weak on small objects, and we run our detector at a lower resolution which may compound that. you can of course run ours at a higher resolution but you'll get a latency hit.
it may be that our model is not a good fit for you but this seems like a fairly niche case and a good opportunity to optimize our model. have you tried training an rf-detr yet?
We are using a few different resolutions of YOLOv8-P2(768x768, 1024x1024, and 1280x1280). Depending on the hardware available we are using nano(n), small(s) and medium(m) sized models. For our embedded Jetson deployments we are using nano or small and for server(A40/L40) deployments medium models. It also depends on the resolution of the PTZ camera modules(as an example we have a Thermal/Visible PTZ unit that outputs 768x576 for all 3 of it's cameras(EO/MWIR/SWIR) and we use the 768x768 sized model). We are able to use these higher resolutions and sizes by converting to TensorRT with FP16/INT8 precision with minimal hits to accuracy to achieve real-time.
The Deformable DETR developers were able to enhance the DETR architecture to be able to handle small objects(https://github.com/fundamentalvision/Deformable-DETR), but it's rather compute intensive.
I haven't tried training yet an RF-DETR model, wanted to first ask the small object detection question before deep diving. My first thoughts are to augment that dataset using Mosaic to see if that helps after doing a base model training.
we use deformable attention :) it's less compute intensive with modern model compilation.
try it and see what happens! the rf-detr base at 560 should be similar runtime to yolov8m at 640, we're hoping to release faster variants soon. curious how it does!
Hi! Travis, I have experimented with RF-DETR on small object detection. RF-DETR's performance is better! Especially on those occasions that the objects' scale differs or some objects are missing by annotators.
The architectures of YOLO I used for comparison are YOLOv11n and YOLOV8 with P2 variant. Due to confidentiality concerns, I can not show you my results, but I highly recommend you train an RF-DETR model on your own datasets and determine whether it works on your application.
Perhaps all that is required here is documentation/tutorial on selecting the best parameters to achieve good performance on small objects. I'm interested in objects that could be as small as 2x2 pixels - somehow I would be interested in an arg mean_pixels which would lead to the selection of the optimum parameters for targets of that size. I guess this might become a workflow on Roboflow
@robmarkcole do other libraries have a utility like that?
@isaacrob-roboflow no I've not seen it elsewhere
ok. in that case we'll keep it in mind but not prioritize :) I am working on something that may significantly help in this case as well
@whittenator did you end up having success here?
Found a relevant paper, alongside a dataset for eval
- https://arxiv.org/abs/2404.03507
- https://github.com/hoiliu-0801/DQ-DETR
we use deformable attention :) it's less compute intensive with modern model compilation.
try it and see what happens! the rf-detr base at 560 should be similar runtime to yolov8m at 640, we're hoping to release faster variants soon. curious how it does!
What sizes are possible for rf-detr base? I receive nan predictions as soon as I increase the size parameter. I also work on a small object detection task and the image resolution is 1942x2590, so it would be ideal to input the maximum width size possible.
@whittenator did you end up having success here?
Sorry Isaac for the late reply, I got pulled away to support another effort but I am now back on the computer vision task. I will actually start a training this week and let you know what I find. Thanks