rf-detr Add Additional Transformations for Image Augmentation, better Finetuning on Small Datasets

While RF-DETR performs excellently out-of-the-box with large labeled datasets, I have been hitting a wall on fine-tuning runs for RF-DETR on smaller datasets (~1000 images with 1-5 objects per image). I suspect that training in such cases may be significantly enhanced by the introduction of additional image augmentations to the training pipeline for this framework, such as random application of color hue shift, rotation, translation, scale, shear, and mosaic.

Without additional data augmentation, DETR-based frameworks in particular may struggle to finetune on small datasets.

In my past training runs with YOLO-based frameworks, color hue shift data augmentation has been particularly-important for generalizing models to accept more color variation in real world objects.

@WongKinYiu's Yolov7 repo has a good implementation of such data augmentation (though under copyleft GPL-3.0 license): https://github.com/WongKinYiu/yolov7/blob/a207844b1ce82d204ab36d87d496728d3d2348e7/train.py#L636

Mar 30 '25 18:03 mkrupczak3

I’ve also tried fine-tuning on a small dataset (~200 images for a single class), but so far, I haven’t been able to successfully train a good model. I’m noticing a significant gap between training and validation mAP metrics, which likely points to overfitting. I believe augmentations could help a lot, especially if they are integrated into the DETR-based framework. So far, YOLOv11n and YOLOv11s have achieved great mAP performance for my task, but I have no experience with using transformers. Any suggestions? Maybe for these small datasets, even pre-trained transformers may not generalize well enough.

Mar 31 '25 11:03 panagiotamoraiti

Hi @mkrupczak3 and @panagiotamoraiti 👋🏻

Adding more advanced augmentations is definitely on our radar. Right now, we’re pretty much only doing horizontal flips, which in some cases might even hurt the model. We’re planning to start building a fully configurable augmentation set based on the albumentations library.

Mar 31 '25 11:03 SkalskiP

There are some interesting reasons why YOLO models tend to use more augmentations on COCO, which also happens to be helpful for small datasets. We don't need lots of augmentations to be stable on COCO or to win on RF100-VL, but it would likely make us win by more :) we'll write support soon

Mar 31 '25 15:03 isaacrob-roboflow

It makes more sense to simply provide a yaml template that defines albumentations augmentations and the path to this can be passed as an argument. Everyone can then decide their own augmentations (it is use-case specific anyway). Otherwise, you will get million requests asking why don't you have this or that augmentation from all sorts of people.

Mar 31 '25 20:03 ogencoglu

makes sense! want to contribute that? otherwise we will get to it when we have bandwidth :)

Mar 31 '25 22:03 isaacrob-roboflow

I'm interested in contributing! Could you provide more details on the intended functionality or any specific requirements?

Apr 01 '25 08:04 panagiotamoraiti

Something like: model.train(..., augmentations_val: None, augmentations_train: 'path/to/aug_train.yaml')

Claude 3.7 with albumentations docs added as docs in Cursor can one-shot this probably.

Apr 01 '25 09:04 ogencoglu

@ogencoglu how would that yaml look like? Is that a practice used in any open-source model training library?

Apr 01 '25 16:04 SkalskiP

Something like

Dataset_1:
  HorizontalFlip:
      p: 0.5

  VerticalFlip:
      p: 0.5

  Affine:
    scale:
      - 0.4
      - 1
    balanced_scale: true
    keep_ratio: true
    cval:
      - 1
      - 1
      - 1
    cval_mask: 0
    interpolation: 3
    mask_interpolation: 0
    p: 0.5

  Rotate:
    limit:
      - -180
      - 180
    border_mode: 0
    value:
      - 1
      - 1
      - 1
    mask_value: 0
    p: 1

...

Dataset_ 2:
...

then you load the definitions:

        augmentations_train = {
            key: A.Compose([getattr(A, tran)(**kwargs) for tran, kwargs in value.items()])
            for key, value in aug_dict.items()
        }

and use in your DataLoader:

        dataloaders = {
            "train": torch.utils.data.DataLoader(
                YourDatasetClass(
                    transform=augmentations_train,
                     ...
                ),
                shuffle=True,
            ),
            "val": torch.utils.data.DataLoader(
...
        }

Nothing fancy. Different ways to do it but the the idea is the same. This way, you can treat the yaml config as a set of hyperparams instead of having them in code.

Apr 01 '25 16:04 ogencoglu

@ogencoglu do you have a reference to a different open source library that uses this kind of framework

Apr 01 '25 22:04 isaacrob-roboflow

Currently, I'm using this script to generate augmented images for my datasets.

import os
import cv2
import albumentations as A
from tqdm import tqdm

# Directories
input_dir = r'train'
image_dir = os.path.join(input_dir, 'images')
label_dir = os.path.join(input_dir, 'labels')
output_dir = r'augmented'
output_image_dir = os.path.join(output_dir, 'images')
output_label_dir = os.path.join(output_dir, 'labels')

# Create the output directories if they do not exist
os.makedirs(output_image_dir, exist_ok=True)
os.makedirs(output_label_dir, exist_ok=True)

# Define augmentation pipeline using Albumentations
transform = A.Compose([
    # A.VerticalFlip(p=0.5),  # Flip vertically
    A.RandomBrightnessContrast(p=0.3),  # Adjust brightness and contrast
    A.RandomRain(p=0.1, rain_type="heavy", slant_range=(-30, -30)),
    A.Affine(p=0.3, scale=0.8, shear=5, translate_percent=0.05, rotate=15),
    # A.GaussianBlur(blur_limit=5, p=0.3),  # Apply Gaussian blur
    A.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.2, hue=0.1, p=0.3),  # Adjust color properties
    A.RandomShadow(p=0.3),  # Add random shadows
    A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.3),  # Adjust hue, saturation, and value
    # A.MotionBlur(blur_limit=5, p=0.2),  # Apply motion blur
    A.CLAHE(clip_limit=4.0, p=0.35),  # Apply CLAHE
    # A.Perspective(scale=(0.05, 0.4), p=0.2),  # Apply perspective transformation
    # A.ElasticTransform(alpha=1, sigma=50, p=0.2),  # Apply elastic deformation
    # A.RandomCrop(height=640, width=640, p=0.1),  # Random crop to a specific size
    # A.InvertImg(p= 0.2),
	# A.Rotate(p=0.3,limit=(-20,20))
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

# YOLO format assumes normalized coordinates between 0 and 1, so we need the image dimensions
def read_image_and_labels(image_path, label_dir):
    image_name = os.path.basename(image_path)
    label_name = image_name.replace('.jpg', '.txt').replace('.png', '.txt')
    label_path = os.path.join(label_dir, label_name)
    
    image = cv2.imread(image_path)
    h, w, _ = image.shape
    labels = []
    
    if os.path.exists(label_path):
        with open(label_path, 'r') as f:
            for line in f.readlines():
                parts = line.strip().split()
                cls = parts[0]  # class_id
                bbox = list(map(float, parts[1:]))  # YOLO format: (center_x, center_y, width, height) normalized
                labels.append((cls, bbox))
    return image, labels, h, w

def write_augmented_data(output_image_path, output_label_path, image, labels, img_h, img_w):
    # Save the augmented image
    cv2.imwrite(output_image_path, image)
    
    # Write augmented labels
    with open(output_label_path, 'w') as f:
        for cls, bbox in labels:
            # YOLO format: normalized (center_x, center_y, width, height)
            f.write(f"{cls} " + " ".join(map(str, bbox)) + "\n")

def apply_augmentation(image, labels, img_h, img_w):
    # Prepare bounding boxes and class labels for augmentation
    bboxes = []
    class_labels = []
    for cls, bbox in labels:
        bboxes.append(bbox)
        class_labels.append(cls)

    try:
        # Apply augmentation
        augmented = transform(image=image, bboxes=bboxes, class_labels=class_labels)
        aug_image = augmented['image']
        aug_bboxes = augmented['bboxes']
        aug_labels = list(zip(class_labels, aug_bboxes))  # Re-pair class labels with augmented bboxes

        return aug_image, aug_labels

    except Exception as e:
        print(f"An error occurred during augmentation: {e}")
        return None, None

# Process the images and apply augmentations
image_files = [f for f in os.listdir(image_dir) if f.endswith(('.jpg', '.png'))]

# Progress bar with tqdm
for image_file in tqdm(image_files, desc="Processing Images", unit="file"):
    image_path = os.path.join(image_dir, image_file)
    
    # Define output paths for augmented images and labels
    output_image_path = os.path.join(output_image_dir, 'aug_' + image_file)
    output_label_path = os.path.join(output_label_dir, 'aug_' + image_file.replace('.jpg', '.txt').replace('.png', '.txt'))
    
    # Read image and its YOLO bbox labels
    image, labels, img_h, img_w = read_image_and_labels(image_path, label_dir)
    
    # Apply augmentation
    aug_image, aug_labels = apply_augmentation(image, labels, img_h, img_w)

    if aug_labels is None:
        continue
    
    # Save augmented image and labels
    write_augmented_data(output_image_path, output_label_path, aug_image, aug_labels, img_h, img_w)

Apr 09 '25 07:04 sctrueew

I would recommend against augmenting images in an offline fashion. It is likely to be strictly less effective than doing it online

Apr 09 '25 15:04 isaacrob-roboflow

Good afternoon @ogencoglu, @isaacrob-roboflow, @SkalskiP, and @panagiotamoraiti,

Training RF-DETR-B/L on quite a large dataset resulted in promising results, even after only 10 epochs! For my research I'm planning to add more data augmentations, specifically tailored for object detection under occlusion.

Is this feature already "Work-In-Progress", or is it possible to work on it (In the fashion @ogencoglu suggested) in the upcoming months?

I first need to annotate quite some new data, before I can potentially start working on this feature.

Apr 14 '25 15:04 DatSplit

we're currently studying how certain augmentations affect large-scale training dynamics for the model, with the intention of releasing new and improved checkpoints using those and other changes in the near future.

however, from that perspective we don't necessarily have intention to build a generic suite of augmentations. if you @DatSplit or @ogencoglu point us to another open source library that has a very convenient abstraction for piping in arbitrary augmentations, we can start working on shipping that. otherwise we're mainly interested in augmentations that we believe reason should meaningfully improve the model in general.

what are the augmentations that you're interested in?

Apr 14 '25 16:04 isaacrob-roboflow

@isaacrob-roboflow I think we should add support for configurable augmentations in this repo.

Apr 14 '25 16:04 SkalskiP

there are an infinite number of augmentations that could be of interest. I think we either need to pipe those augmentations from a third party, or limit the scope of what 'configurable augmentations' means

Apr 14 '25 16:04 isaacrob-roboflow

I guess an assumption that I've had is that I want to use the same infrastructure other repos use to pipe in augmentations, and I was hoping someone could point me to such infrastructure. that may be an incorrect assumption. do you think we should spend time building in-house infra to pipe albumentations augmentations with some kind of configurable tooling @SkalskiP ?

my specific concern with supporting albumentations naively via accepting a composed transform set is that augmentations can change correct labels for detection and, while albumentations has support for that, it is not immediately clear to me how to enforce that it is being done correctly. it feels non trivial to ship something truly plug and play here

Apr 14 '25 17:04 isaacrob-roboflow

Good evening @SkalskiP and @isaacrob-roboflow,

@SkalskiP That is an interesting and challenging study! For the current research project the focus lies on using data augmentations that might improve robustness to increasing levels of occlusion. For example, augmentation techniques such as CutOut, MixUp, and TransMix.

Unfortunately, I do not know an open-source library that has such a convenient abstraction for piping in many types of augmentations, nor have I looked into it deeply enough. I'll be looking at such a library, a bit during, but most likely after the "data annotation" part of this project is done. (end of April / start of May)

@isaacrob-roboflow That is indeed a valid point, the non-triviality of composed augmentations.

Lastly, I would like to thank you and the rest of the Roboflow team for the amazing work that you did on developing RF-DETR.

Apr 14 '25 21:04 DatSplit

I’ve worked with the mmengine library and specifically with mmdetection, which has a flexible pipeline system for augmentations. It also allows integrating custom transformations from the Albumentations library pretty smoothly.

If you're interested, here are a couple of useful links you might want to check out: https://mmengine.readthedocs.io/en/latest/advanced_tutorials/data_transform.html https://mmpretrain.readthedocs.io/en/dev/api/generated/mmpretrain.datasets.transforms.Albumentations.html

Apr 18 '25 15:04 panagiotamoraiti

I think what we're going to do is allowing passing an albumentations pipeline, and also have a constructor for one that we know works out of the box. Am I remembering correctly @SkalskiP ?

Apr 21 '25 14:04 isaacrob-roboflow

Most training wrappers I've seen (huggingface, PyTorch Lightning) allow you to pass in a custom collate_fn, where you would define transforms. Then it's up to the end-user which libraries they pull in and how they're composed. So exposing collate_fn to the trainer function would be great.

Otherwise, for CLI, I don't think there's a standard way. The original RT-DETR repo has this interface for specifying transforms in a yaml file: https://github.com/lyuwenyu/RT-DETR/blob/a80dee362b5444c1af3e8bfcb965a8892112bfe2/rtdetrv2_pytorch/configs/rtdetrv2/include/dataloader.yml#L2-L38

The transforms are defined in a .py file and need to be registered to be picked up by the training script.

Jun 22 '25 00:06 cduong-a

Hello, I just opened a pull request about this issue. I implemented a custom wrapper, to support augmentations from Albumentation Library. These augmentations are specified in a config file.

Jul 16 '25 12:07 panagiotamoraiti

Hi @mkrupczak3 3 and @panagiotamoraiti ,

Following up on this discussion, I was wondering if you eventually succeeded in fine-tuning a good model on your small datasets?

I'm facing a similar challenge and would appreciate any advice or suggestions you might have for fine-tuning on small datasets.

Thanks a lot!

Oct 12 '25 04:10 benny566

Hi @benny566,

I've done testing with a bunch of different training configs for RF-DETR, unfortunately none have made a significant difference for small datasets. You may get better results by increasing the size of your dataset (I recommend the Tator tool by @stephansturges for labeling) or using another framework with additional built-in augmentations such as yolov9 or yolov11: https://github.com/WongKinYiu/yolov9 https://github.com/ultralytics/ultralytics

Also, if you are able to find or train weights on a similar object detection task beforehand, it will be much easier to fine tune a model on your small dataset in a process known as 'transfer learning'. Ideally you want to find a model or dataset with objects which appear very similar to the ones from your own small dataset.

The only recommendation I have for RF-DETR specifically for small datasets is to try using the cosine scheduler instead of the default step. With small datasets and the smaller variants of RF-DETR, the cosine scheduler helps training by smoothly transitioning to lower LR after initial exploration phase. This helps the model converge better at later epochs.

You also want to set the following values to True to enable the basic image augmentations RF-DETR is already capable of:

    multi_scale=True,
    expanded_scales=True,
    do_random_resize_via_padding=True,

In the following screenshots below, the alfa models were trained with step and bravo were trained with cosine. The training dataset is the UAV Detection dataset from droneforge with roughly 10,000 images:

As you can see from the graphs, cosine provides a slightly more stable and predictable training process for the smaller variants of RF-DETR on smaller datasets.

Below are some of my training scripts. They were written for a training rig with 4 GPU's, if you have less you should bump up the grad_accum_steps value accordingly.

uav_detection_nano_bravo.py

#!/usr/bin/env python3
from rfdetr import RFDETRNano

model = RFDETRNano(resolution=1120)

model.train(
    dataset_dir='/hpcuser/uav-detection-v3i-coco',
    epochs=250,
    batch_size=2,
    grad_accum_steps=2,    # 2*2 * 4 GPU's = 16 effective batch size
    lr=1.3e-4,
    lr_encoder=8e-5,
    weight_decay=3e-4,
    dropout=0.05,
    #drop_path=0.02,
    lr_scheduler='cosine',
    warmup_epochs=5,
    lr_min_factor=0.03,
    multi_scale=True,
    expanded_scales=True,
    do_random_resize_via_padding=True,
    use_ema=True,
    ema_decay=0.9998,
    output_dir='/hpcuser/rf-detr/uav_detection_nano_bravo',
    tensorboard=True
)

uav_detection_small_bravo.py

#!/usr/bin/env python3
from rfdetr import RFDETRSmall

model = RFDETRSmall(resolution=1120)

model.train(
    dataset_dir='/hpcuser/uav-detection-v3i-coco',
    epochs=250,
    batch_size=2,
    grad_accum_steps=2,    # total ~16 with 4 GPU's
    lr=1.2e-4,             # small heads tolerate a touch higher LR
    lr_encoder=7e-5,
    weight_decay=3e-4,
    dropout=0.05,
    #drop_path=0.02,
    lr_scheduler='cosine',
    warmup_epochs=5,
    lr_min_factor=0.03,
    multi_scale=True,
    expanded_scales=True,
    do_random_resize_via_padding=True,
    use_ema=True,
    ema_decay=0.9998,
    output_dir='/hpcuser/rf-detr/uav_detection_small_bravo',
    tensorboard=True
)

uav_detection_medium_bravo.py

#!/usr/bin/env python3
from rfdetr import RFDETRMedium

# Slightly larger per-iter batch is OK on Medium
model = RFDETRMedium(resolution=1344)

model.train(
    dataset_dir='/hpcuser/uav-detection-v3i-coco',
    epochs=250,
    batch_size=2,
    grad_accum_steps=2,
    lr=1e-4,
    lr_encoder=6e-5,
    weight_decay=3e-4,
    dropout=0.05,
    lr_scheduler='cosine',
    warmup_epochs=5,
    lr_min_factor=0.03,
    multi_scale=True,
    expanded_scales=True,
    do_random_resize_via_padding=True,
    use_ema=True,
    ema_decay=0.9998,
    output_dir='/hpcuser/rf-detr/uav_detection_medium_bravo',
    tensorboard=True
)

uav_detection_large_bravo.py

#!/usr/bin/env python3
from rfdetr import RFDETRLarge

model = RFDETRLarge(resolution=1344)

model.train(
    dataset_dir='/hpcuser/uav-detection-v3i-coco',
    epochs=250,
    batch_size=1,
    grad_accum_steps=4,
    lr=1e-4,               # head & decoder
    lr_encoder=5e-5,       # backbone slower
    weight_decay=3e-4,
    dropout=0.05,
    lr_scheduler='cosine',
    warmup_epochs=5,
    lr_min_factor=0.03,    # 3% of initial at the tail
    multi_scale=True,
    expanded_scales=True,
    do_random_resize_via_padding=True,
    use_ema=True,
    ema_decay=0.9998,
    output_dir='/hpcuser/rf-detr/uav_detection_large_bravo',
    tensorboard=True
)

Oct 13 '25 19:10 mkrupczak3