darknet detection small object from big images

Hi,I want detection traffic sign from big images.What can I do to improve the accuracy? And I want to try to change the network. Should I just change the convolution of cfg? If I want to reduce the number of downsampling layers in Yolov3, how can I do? Thank you.

Apr 05 '19 19:04 chi8411

@chi8411 Hi,

You should use width=832 height=832 or width=1024 height=1024 in your cfg-file
Also you can try to use these modified cfg-files:
- https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3_5l.cfg
- https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny_3l.cfg

Apr 05 '19 19:04 AlexeyAB

ok, thank you. I will try.

Apr 05 '19 20:04 chi8411

how big width/height in pixels are the items you are trying to detect? and how big resolution wise is your source image?

Apr 06 '19 05:04 kooscode

您要檢測的項目的寬度/高度（像素）有多大？你的源圖像有多大分辨率？

Hi,my image is 2048*2048. The goal I want to detect is the traffic sign, not big.

Apr 06 '19 17:04 chi8411

How many pixels width and height is 'not big' ?

Apr 06 '19 18:04 kooscode

寬度和高度有多少像素“不大”？ The width and height are about [0,300], but the maximum range is [30,60].

Apr 06 '19 20:04 chi8411

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

Apr 07 '19 00:04 kooscode

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

so, your mean is a picture of 2048x2048 is cut into 16 pieces of 512x512. Is it? Will the project be cut so that it cannot be detected? sorry, I can't understand "inference per block" and "train with same tiles". Can you say in detail? Or do you have a paper for reference? Thank for your help.

Apr 07 '19 15:04 chi8411

@kooscode , I'm trying to train yoloV3 for Drone Dataset with different sizes,HD ,Full Hd, and 4K . Do i have to change image width and height , and anchors box or not ? what do you suggest to improve accuracy?

thanks,

Apr 07 '19 17:04 wahid18benz

@chi8411 - Yes, you can cut the 2048x2048 image into 16 images of 512x512.

You can use a sliding window of 512x512 with a stride of 480pixels or somethings similar, meaning all the 512x512 squares will have at least 30 pixel overlap, so you wont miss stuff. you would need to remove duplicates though..

@wahid18benz - same question, what are you trying to detect and how big physical size is it and what GSD are you using? are you flying constant AGL and with terrain following?

Apr 07 '19 17:04 kooscode

@wahid18benz, the anchor boxes should match the shape of your objects for better alignment and shape of predicted boxes. For example if you use a pre-defined network and its anchor boxes were meant for things like pedestrians and traffic signs (i.e. long rectangles), it will have a hard time with accurate alignment of perfect square boxes around objects.

Apr 07 '19 17:04 kooscode

@kooscode Excuse me, the picture is processed beforehand, or it is cut after entering a picture in yolov3. Is your sliding window another program or is it added to yolov3? I want to know more about this method!! I think it's a good way to detect small objects. Thank you.

Apr 07 '19 17:04 chi8411

@chi8411 - it is not part of yolo, i wrote it myself and yes, its processed into tiles of 512x512 (or whatever your network input size is) and then inference.

we use a multi threaded multi GPU inferencing system on aerial images cut these 512x512 blocks using sliding window with stride from same image and then in parallel inference across multi GPU's and then remove any duplicates and restore back to original image coordinates.

you essentially end up with a neural net of any size being able to inference an image of any size and its very fast and very accurate since its 1:1 resolution from source image to network input.

We are working on modifying YOLO so we can use the similar region proposal algorithm to identify which tiles contain objects of interest and then only do full neural net on those tiles. but right now sliding window works well.

Apr 07 '19 17:04 kooscode

@kooscode so, suppose you want to train a picture. Your yolo input is 16 pictures and 16 new labels. Is it? or your input is a image of the object and new labels? What about your test? cut or not cut? If it is not cut, can it be detected?

Thank you.

Apr 07 '19 18:04 chi8411

@kooscode
I don't have information about GSD and AGL , I'm using Visdrone Dataset http://aiskyeye.com/views/getInfo?loc=3 I have ten classes : pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle.

Apr 07 '19 18:04 wahid18benz

Gotcha. In this case, I would suggest you compute anchor boxes for this particular application.

Apr 07 '19 19:04 kooscode

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference. I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles. we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

@kooscode Do you crop the image before training or when detecting objects ? I want to detect cars from video filmed by UAV. The video resolution is 1920x1080, and target objects is about 15x15 pixels. I have tried make my training dataset by cuting the images, and set width/height 832/832 in cfg file, but the result is not very good.
Thank you!

Apr 09 '19 10:04 Erissonleo

You have to train and inference with the same full resolution 512x512 crops from original image, yes.

Apr 09 '19 14:04 kooscode

@chi8411 Are you working on the Tsinghua-Tencent 100K dataset? How are you going on with the project? I'm also working on this dataset and trapped in the small sign detection. I think the approach kooscode advised would still not work well on this dataset for the differences between the two problems:

Traffic signs are highly similar to each other, more details are needed to distinguish among them. However, kooscode might only need to detect one class in the air or classes with significant differences @kooscode Is that the case? Sorry if I make mistakes so fewer details of the object are needed. More examples like the paper "find tiny faces".

What's more, the first two convs in YOLO v3 also downsample the images 4X even if you put the original resolution in. The traffic signs of size [16,32](which account for nearly 25% in dataset) still face the lack of details.

So I think YOLO v3 may not be the best recipe in this case. A two-stage method with region proposal(RP) and classification might be a better choice. Classification should work on the regions of the original input of the image. The only problems I estimate are:

Roi pooling is not implemented in darknet. RP and Classification may have to work seperately.
Train the two part of the CNN respectively. The classification training dataset need to be created from Tsinghua-tencent 100k dataset by yourself.

Apr 23 '19 08:04 ZHI-ANG

Is it possible to start training yolo with a set of parameters (width and height), stop the training, change these parameters an then continue the training ? More like will it affect negatively our decreasing loss curve ?

Jun 21 '19 07:06 YKritet

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

I don't understand very well about the "512x512 blocks/tiles", how do i make that? Do we have to use multi-gpu? I have images of different sizes ranging from 800x600 to 1270x720 and i want to detect very small object like a dash on numberplate car i'm using a tesla v100

Sep 02 '19 12:09 faybak

you cut up the image into 512x512 squares and inference each square.

if you are finding number plates, i would recommend you instead detect the number plate, extract that bounding box as a region of interest and then feed that into a different network at full resolution.

--

Sep 03 '19 02:09 kooscode

Is it possible to start training yolo with a set of parameters (width and height), stop the training, change these parameters an then continue the training ? More like will it affect negatively our decreasing loss curve ?

yes you can. id recommend you extract convolutional weights from your first training set and use that for transfer learning job using a bigger input size..

same goes for once you trained, you can inference with any size network.

Sep 03 '19 02:09 kooscode

then feed that into a different network at full resolution.

how can I feed that into a different network at full resolution? when i crop the plate detected for making detection for numberplate, it doesn't recognize anything. But if i'm using the principale image without crop it can recognize some caractere. but not all. How can i do that

Sep 03 '19 12:09 faybak

how can I feed that into a different network at full resolution?

Well, if you have the full resolution image and you crop out the number plate as ROI, that cropped image is at full resolution.. so just feed that into a number plate reader...

https://blog.yellowant.com/automate-license-plate-recognition-in-3-simple-steps-f50886177d2e

Sep 04 '19 05:09 kooscode

My idea was to use only yolo along the process . Is it possible?

Sep 10 '19 08:09 faybak

Well, I guess you can broach a square peg into a round hole with a huge hammer and enough force..

Why not use the right tool for the right job?

https://nanonets.com/blog/attention-ocr-for-text-recogntion/

Sep 10 '19 15:09 kooscode

we use a multi threaded multi GPU inferencing system on aerial images cut these 512x512 blocks using sliding window with stride from same image and then in parallel inference across multi GPU's and then remove any duplicates and restore back to original image coordinates.

@kooscode does your training set consist of image of 512*512 size?

Because my training dataset consists of 256x256 patches, where the object I want to detect cover 10% to 40% of the image. But then I want to detect these objects from 20000x20000 images!

Should I train using width and height of 256 and then test with same configuration, on 256x256 patches that I split from the original image?

Nov 20 '19 16:11 matteoguidi

@matteoguidi - yes, we trained on 512x512

in your case, you should train on 256x256 full resolution using a network with same 256x256x3 input size.

Then during inference you can cut your 20kx20k image up into whatever size you want, so long you then also adjust your inference network input size to that same size. for example if you have good hardware and can deal with a 928x928 input size, then cut your imges into those block sizes and feed them 1:1 ratio to input size into the network.

I would also suggest when you tile your image, use a overlap stride so you dont miss cut in half (or more) objects and you then de-duplicate after reconstructing detected objects locations back into the 20kx20k image

does that make sense ?

Nov 20 '19 17:11 kooscode

@kooscode yep, this totally makes sense. I will try to do that and see what results I obtain.

Just one last question, did you connect the layers in the .cfg file as Alexey did?

for training for small objects (smaller than 16x16 after the image is resized to 416x416) - set layers = -1, 11 instead of https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L720 and set stride=4 instead of https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L717

Thank you very much for your help!

Nov 21 '19 08:11 matteoguidi