rf-detr icon indicating copy to clipboard operation
rf-detr copied to clipboard

Issue with training RF-DETR: Inference Tensors cannot be saved for backwards.

Open DanielCruz09 opened this issue 3 months ago • 10 comments
trafficstars

Search before asking

  • [x] I have searched the RF-DETR issues and found no similar bug report.

Bug

I am training the RF-DETR model on my custom dataset. I noticed two issues when I do this:

  1. The model seems to detect 0 classes from my dataset. I made sure to convert my dataset to COCO format and added annotations. Is there anything else I need to do to my dataset? (I noticed some of the other posts have also asked about this, but I haven't found a solution for this.)
  2. I also get this error when I train the model:
RuntimeError                              Traceback (most recent call last)
Cell In[42], [line 7](vscode-notebook-cell:?execution_count=42&line=7)
      3 from rfdetr import RFDETRBase
      5 model = RFDETRBase(num_classes=4)
----> [7](vscode-notebook-cell:?execution_count=42&line=7) model.train(
      8     dataset_dir='datasets/images_resized',
      9     train_ann_file='datasets/images_resized/train/_annotations.coco.json',
     10     valid_ann_file='datasets/images_resized/valid/_annotations.coco.json',
     11     epochs=10,
     12     batch_size=4,
     13     grad_accum_steps=4,
     14     lr=1e-4
     15 )

File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:81, in RFDETR.train(self, **kwargs)
     77 """
     78 Train an RF-DETR model.
     79 """
     80 config = self.get_train_config(**kwargs)
---> [81](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/danie/ml_projects/~/AppData/Roaming/Python/Python312/site-packages/rfdetr/detr.py:81) self.train_from_config(config, **kwargs)

File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:187, in RFDETR.train_from_config(self, config, **kwargs)
    179     early_stopping_callback = EarlyStoppingCallback(
    180         model=self.model,
...
--> [549](file:///C:/Users/danie/anaconda3/Lib/site-packages/torch/nn/modules/conv.py:549) return F.conv2d(
    550     input, weight, bias, self.stride, self.padding, self.dilation, self.groups
    551 )

RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.

I cannot figure out how to fix this issue.

Environment

  • RF-DETR: 1.2.1
  • OS: Microsoft Windows 11 Pro (64-bit)
  • Python: Python 3.13.5
  • PyTorch: 2.7.1
  • GPU: AMD Radeon Graphics

Minimal Reproducible Example

Note that I am using a custom dataset.

from rfdetr import RFDETRBase

model = RFDETRBase(num_classes=4)

model.train(
    dataset_dir='datasets/images_resized',
    train_ann_file='datasets/images_resized/train/_annotations.coco.json',
    valid_ann_file='datasets/images_resized/valid/_annotations.coco.json',
    epochs=10,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4
)

Additional

No response

Are you willing to submit a PR?

  • [ ] Yes, I'd like to help by submitting a PR!

DanielCruz09 avatar Jul 31 '25 23:07 DanielCruz09

Weird! We haven't tested on windows or AMD. Can you use an earlier version and see if it works?

isaacrob-roboflow avatar Aug 01 '25 15:08 isaacrob-roboflow

We also don't have access to those environments to test 😅

isaacrob-roboflow avatar Aug 01 '25 21:08 isaacrob-roboflow

@DanielCruz09 can you confirm it's actually using the GPU?

isaacrob-roboflow avatar Aug 06 '25 16:08 isaacrob-roboflow

@DanielCruz09 can you confirm it's actually using the GPU?

I do not think it was using the GPU; I encountered issues with using my GPU (this might be a Windows issue). Does the GPU affect backprop?

DanielCruz09 avatar Aug 06 '25 22:08 DanielCruz09

It shouldn't but we're seeing other issues where this bug pops up on CPU training. Just trying to figure out if you're seeing this bug because of a different issue or if it's likely the same root cause

isaacrob-roboflow avatar Aug 06 '25 22:08 isaacrob-roboflow

I think the issue may possibly be in engine.py. Where 'with.torch.inference_mode():' is used during training 'if args.multi_scale and not args.do_random_resize_via_padding'. Not 100% sure why but changing it to 'with torch.no_grad():' seems to have fixed that error for me.

S-Mahoney avatar Aug 17 '25 17:08 S-Mahoney

Good catch @S-Mahoney! Seems like a PyTorch bug. Can you submit a PR and explain how you tested it?

isaacrob-roboflow avatar Aug 18 '25 14:08 isaacrob-roboflow

I get the same issue trying to run it locally with MPS, CPU for Macbook pro M4

JJjulle avatar Sep 11 '25 10:09 JJjulle

I ran into the same problem. The first epoch trains correctly, then val evaluation is run and the crash happens on second epoch.

S-Mahoney's solution to change with.torch.inference_mode(): to torch.no_grad() fixes it. Strangely, the error happens even if args.multi_scale is False. I have torch==2.7.0+cu128 and running on latest develop branch commit 9fd9789.

qrmt avatar Oct 10 '25 06:10 qrmt

Hi, we have also this issue. We ran it on Windows Machine as well as jupyter lab instances on aws ( linux) on cheaper CPU instances for testing first. it crashes already on first epoch. We tried to train Small & Medium models. Both got this error.

We used the input as in the documentation on google colab (using from module the classes like rfdetr.RFDETRMedium() )and same input parameters for epochs and batch_site

rfdetr=1.3.0 torch=2.8.0

also related issues which seem to be duplications: #368 #367 #366

We'd love to test if this is feasible to replace our own yolo implementation and increase our A.I. performance & quality :)

edit

The deployment with gpu attached works. So this is an CPU only problem. For efficient Training a GPU is needed anyways but the error is very confusioning so maybe an additional error message would already help that Training with cpu is not supported

DaMuBo avatar Oct 14 '25 14:10 DaMuBo