super-gradients YOLO-NAS L fails to detect small objects

Describe the bug

Hi all! 👋🏻 I'm currently in the process of making a YouTube tutorial where I show people how to train YOLO-NAS on custom datasets and how to use it in custom applications. I wanted to focus on the model's ability to detect small objects, and because I have some experience with football ⚽ detection that is the use case I went for. Unfortunately for now I'm stuck as the model fails to detect the ball most of the time. I tried both: model pre-trained on COCO set and model fine-tuned on my own dataset and here are the results:

model pre-trained on COCO

https://github.com/Deci-AI/super-gradients/assets/26109316/582a9cf1-339c-4330-a91f-7e17c8014635

model fine-tuned on my own dataset

https://github.com/Deci-AI/super-gradients/assets/26109316/673facd3-00a5-42a7-80d4-a28cd9fd9e6a

As you can see in both cases I'm doing much worse than in your example video:

https://github.com/Deci-AI/super-gradients/assets/26109316/801fed65-0c76-45fc-bde2-e452de8455ed

Could you give me some guidelines on how to replicate your result? Or take a look at my code and point out any mistakes I made?

To Reproduce

Here is my training notebook that I plan to share as part of the tutorial: https://colab.research.google.com/drive/1-qwFVhis4wMAv3if_L9v-OJNQP7gErLW?usp=sharing
Here is my inference code:

import torch

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL_ARCH = 'yolo_nas_l'
CHECKPOINT_PATH = f'{HOME}/models/yolo_nas_football_players_detection.pth'
CONFIDENCE_TRESHOLD = 0.3

from super_gradients.training import models

yolo_nas_model = models.get(
    MODEL_ARCH,
    num_classes=len(CLASSES),
    checkpoint_path=CHECKPOINT_PATH
).to(DEVICE)

import numpy as np
import supervision as sv


box_annotator = sv.BoxAnnotator()


def process_frame(scene: np.ndarray, index: int) -> np.ndarray:
    result = list(yolo_nas_model.predict(scene, conf=CONFIDENCE_TRESHOLD))[0]
    detections = sv.Detections.from_yolo_nas(result)

    labels = [
        f"{CLASSES[class_id]} {confidence:0.2f}" 
        for _, _, confidence, class_id, _ 
        in detections]

    annotated_frame = box_annotator.annotate(scene=scene.copy(), detections=detections, labels=labels)
    return annotated_frame

  
sv.process_video(
    source_path=SOURCE_VIDEO_PATH,
    target_path='result.mp4',
    callback=process_frame
)

Additional context

Here is the test video I use during inference:

https://github.com/Deci-AI/super-gradients/assets/26109316/8147bb17-47f6-45f5-9b54-41af75bf5a82

May 10 '23 16:05 SkalskiP

Join the discussion on DagsHub!

May 10 '23 16:05 dagshub[bot]

I think it is just the video you chose. You see, the model is predicting in 640x640 (you can train it on higher resolution, of course). In our example, the ball is approximately 5-10 pixels, while in your video, it is around 2-4 pixels. When downscaling the image to 640, the 2-4 pixels become less than one pixel in some cases. I suggest choosing a video where the camera is less zoomed out, or training on a higher resolution (1024x1024 maybe). Hope that helps

May 10 '23 18:05 ofrimasad

Hi @ofrimasad 👋🏻 I did some testing. I took the same video and some it by 10%, 25%, and 50%. Then I run inference on all of those videos. Here are the results. I removed any other class to make the ball more visible. I don't see too much improvement.

https://github.com/Deci-AI/super-gradients/assets/26109316/513dd852-f982-4bd0-ab36-aff59de74c8f

Here is also 100% zoom. It is a noticeable improvement in detection quality, but at this point, we are running inference on 1/4 of the original frame.

https://github.com/Deci-AI/super-gradients/assets/26109316/0e63dc02-cfdf-49ec-91b5-9fbc8bc4dce6

May 10 '23 20:05 SkalskiP

@ofrimasad, could you give me some hints on 1024x1024 resolution training? I think it's worth trying.

May 10 '23 20:05 SkalskiP

I'd suggest to increase image resolution during fine-tuning to match the aspect ratio of the input video. For COCO we used 640x640 which could be not enough if your input image has 1920x1080 resolution. Why? Because for inference we resize (and pad if necessary) to the resolution during training. As you can imagine resizing 1920x1080 -> 640x640 would lead to loss of small objects. A tiled-based inference is on the roadmap to add, but not here yet.

So you what you can do now is the following:

Increase resolution during training: val_dataset_params.input_dim = (720, 1280) train_dataset_params.input_dim = (720, 1280)
Override inference params to run on full-resolution image:

from super_gradients.training.processing import ComposeProcessing, StandardizeImage, ImagePermute
image_processor = ComposeProcessing(
    [
        StandardizeImage(max_value=255.0),
        ImagePermute(permutation=(2, 0, 1)),
    ]
)
yolo_nasmodel.set_dataset_processing_params(image_processor=image_processor)

Or set processing image size from 640x640 to 1280x720

yolo_nasmodel._image_processor.processings[0].output_shape=(720, 1280)
yolo_nasmodel._image_processor.processings[1].output_shape=(720, 1280)

Play with threshold parameters:

yolo_nasmodel.predict(image, confi=0.25)

Looking forward hearing whether it helps.

May 10 '23 22:05 BloodAxe

Hi @BloodAxe 👋🏻! Thanks a lot for all those ideas. I was a bit busy recording the first part of my YOLO-NAs tutorial. I'll make sure to try all your suggestions tomorrow.

May 11 '23 21:05 SkalskiP

@SkalskiP Also looking forward for your progress and great videos :) @BloodAxe, can you please elaborate on steps 1 & 3 you've mentioned? Where can I modify according to step 1? I tried adding these keys to the coco_detection_yolo_format_train, coco_detection_yolo_format_val, coco_detection_yolo_format_val dictionaries, but I don't know if I need to modify also augmentations and also if it is needed to modify the model architecture accordingly to accept such images. Finally, on step 3, what processing do you refer to? Thanks!

May 16 '23 09:05 matanj83

I'm also having issues with changing image size from the default of 640, 640 and have tried the suggestions above @BloodAxe.

Setting the input_dims in this method results in the visualisation of transformed training data with what seems to be annotations that don't match the different sized images.

train_data.dataset.input_dim = (896, 1280)
val_data.dataset.input_dim = (896, 1280)
test_data.dataset.input_dim = (896, 1280)

If all input_dims in the transformed data are also updated to the (896, 1280) then the transformations are padded to half the total image. If it's set to 1280, 1280 then the images look normal but the aspect ratio is lost.

im_dims = (896, 1280)
transforms = [{'DetectionMosaic': {'input_dim': im_dims, 'prob': 1.0}},
         {'DetectionRandomAffine': {'degrees': 10.0, 'translate': 0.1, 'scales': [0.1, 2], 'shear': 2.0, 'target_size': im_dims, 
                                    'filter_box_candidates': True, 'wh_thr': 2, 'area_thr': 0.1, 'ar_thr': 20}},
          {'DetectionMixup': {'input_dim': im_dims, 'mixup_scale': [0.5, 1.5], 'prob': 1, 'flip_prob': 0.5}},
           {'DetectionHSV': {'prob': 1.0, 'hgain': 5, 'sgain': 30, 'vgain': 30}},
            {'DetectionHorizontalFlip': {'prob': 0.5}},
             {'DetectionPaddedRescale': {'input_dim': im_dims, 'max_targets': 120}}]

train_data = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['train_images_dir'],
        'labels_dir': dataset_params['train_labels_dir'],
        'classes': dataset_params['classes'],
        'input_dim': im_dims,
        'transforms': transforms
    },
    dataloader_params={
        'batch_size': BATCH_SIZE,
        'num_workers': 2
    }
)

What is the best way to increase the size of the image when training? I have quite a few small objects and was hoping to test different image sizes without tiling.

May 17 '23 21:05 geezacoleman

can someone please figure out a concise way to modify either the DECI Notebook or the Roboflow notebook to change image size during training?

May 18 '23 18:05 jamesquags

@jamesquags I'm happy to add this to Roboflow's notebook, but I'd need some guidelines from @BloodAxe. Do I need to add steps 1. and 2. to make it work?

Training stage:

val_dataset_params.input_dim = (720, 1280)
train_dataset_params.input_dim = (720, 1280)

Inference stage:

yolo_nasmodel._image_processor.processings[0].output_shape=(720, 1280)
yolo_nasmodel._image_processor.processings[1].output_shape=(720, 1280)

May 18 '23 23:05 SkalskiP

Thanks for putting together that notebook/video @SkalskiP - was very helpful.

Part of the issue is that changing only step 1 doesn't seem to change the dimensions.

This is what I did:

val_dataset_params.input_dim = (720, 1280)
train_dataset_params.input_dim = (720, 1280)

model._image_processor.processings[0].output_shape=(720, 1280)
model._image_processor.processings[1].output_shape=(720, 1280)

This is the log file for the run:

    "dataset_params": {
        "train_dataset_params": {
            "data_dir": "data",
            "images_dir": "images/train",
            "labels_dir": "labels/train",
            "classes": [
                "PA-1",
                "PA-2",
                "PA-3",
                "PA-4",
                "PA-5",
                "SPA-1",
                "SPA-2",
                "SPA-3"
            ],
            "input_dim": "[640, 640]",
            "cache_dir": null,
            "cache": false,
            "transforms": "

If I run train_data.dataset.input_dim = (720, 1280), the result is transformed data that is still 640 x 640 and doesn't fit the annotations:

Have I used this method incorrectly?

I wonder if supergradients could adopt the YOLO v3 to v8 approach of using a --img-size (or similar) flag that sets the image size and rescales + pads to keep aspect ratio.

May 19 '23 07:05 geezacoleman

@geezacoleman great to hear you found it helpful. I'll keep my eye on that issue and update the notebook with custom input image size whenever we come to concrete conclusions.

@ofrimasad and @BloodAxe if I can suggest something, @geezacoleman is probably right. Now, changing input resolution is not intuitive enough. :(

May 19 '23 21:05 SkalskiP

@BloodAxe any updates on changing the image size? I'm in the process of comparing all recent YOLO versions for weed detection in agriculture and would like to include YOLO-NAS but the image size needs to be consistent among them all.

I tried training the YOLO-NAS L model using the approach above but the model did not train well (effectively 0.00 [email protected]). With 640 x 640 resolution the [email protected] was still quite low (~0.3) - so it would be nice to test on higher resolution and benchmark with the others.

May 24 '23 06:05 geezacoleman

Hello @SkalskiP @geezacoleman @ofrimasad @BloodAxe,

Seems that the above methods did not work for you, as for me. Any updates on how to change the input size?

Jun 11 '23 08:06 AlimTleuliyev

@AlimTleuliyev nope it didn’t work for me :/

Jun 11 '23 16:06 SkalskiP

I also tried training the YOLO-NAS three version model to detect basketball player, but the model was quite low (~0.4). By the way, if I use the img size 720x1280, the img is resized automatically to 640x640?

Jun 14 '23 12:06 vodanhday

They didn't work for me either unfortunately. Quite low performance on my dataset with 640 x 640, and with the attempted 1280 x 1280 so I ended up dropping it from the comparison so didn't look into it further. I think the DetectionMixup can decrease performance, so setting that to 0 might be an opportunity to improve it.

Jun 14 '23 12:06 geezacoleman

I think we should make the colab notebook showing how to train YOLO-NAS for bigger images as quite a lot of people facing this issue

Aug 10 '23 13:08 BloodAxe

I think we should make the colab notebook showing how to train YOLO-NAS for bigger images as quite a lot of people facing this issue

any work done on this @BloodAxe ?

Aug 23 '23 06:08 Dev-Devesu

Hello, I join the request to add in the notepad the option to change the dimensions of the images :v

was there any news about this?

Aug 26 '23 22:08 leandrotorrent

@BloodAxe Increasing the image resolution certainly helps with detecting small objects, but it also has a heavy impact on inference latency. I wonder if some tweaks could be made to the model itself that help with small objects, but don't cost as much latency as increasing the resolution?

Jan 04 '24 17:01 vilkkiE

@vilkkiE in the context of inference or training?

When inferencing you clearly don't want to scale image down as it would lead to loosing image details. So that is not an option. So we introduced a skip_image_resizing option to model.predict() to tell the prediction pipeline to skip image resizing and use original image resolution.

Jan 09 '24 14:01 BloodAxe

in the context of inference or training?

@BloodAxe If your question refers to the latency, I mean inference latency.

Naturally you can skip the resize and infer on the original image resolution, and the higher the resolution the better the accuracy, especially on small objects, but higher resolution also causes higher inference latency. I'm wondering if there could be sort of an alternative/tweaked architecture of the model that performs better on small objects but doesn't increase the inference latency as much as higher resolution does.

Jan 09 '24 14:01 vilkkiE

Hello there! Any updates on this? I'm also struggling a lot with small object detection on my model

Feb 10 '24 01:02 icaroryan

We support mode to disable image resizing: model.predict(image, skip_image_resizing=True) Relevant https://github.com/Deci-AI/super-gradients/issues/1402

Feb 12 '24 14:02 BloodAxe

super-gradients super-gradients copied to clipboard

YOLO-NAS L fails to detect small objects

Describe the bug

To Reproduce

Additional context

super-gradients
super-gradients copied to clipboard