rf-detr
rf-detr copied to clipboard
Issue with training RF-DETR: Inference Tensors cannot be saved for backwards.
Search before asking
- [x] I have searched the RF-DETR issues and found no similar bug report.
Bug
I am training the RF-DETR model on my custom dataset. I noticed two issues when I do this:
- The model seems to detect 0 classes from my dataset. I made sure to convert my dataset to COCO format and added annotations. Is there anything else I need to do to my dataset? (I noticed some of the other posts have also asked about this, but I haven't found a solution for this.)
- I also get this error when I train the model:
RuntimeError Traceback (most recent call last)
Cell In[42], [line 7](vscode-notebook-cell:?execution_count=42&line=7)
3 from rfdetr import RFDETRBase
5 model = RFDETRBase(num_classes=4)
----> [7](vscode-notebook-cell:?execution_count=42&line=7) model.train(
8 dataset_dir='datasets/images_resized',
9 train_ann_file='datasets/images_resized/train/_annotations.coco.json',
10 valid_ann_file='datasets/images_resized/valid/_annotations.coco.json',
11 epochs=10,
12 batch_size=4,
13 grad_accum_steps=4,
14 lr=1e-4
15 )
File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:81, in RFDETR.train(self, **kwargs)
77 """
78 Train an RF-DETR model.
79 """
80 config = self.get_train_config(**kwargs)
---> [81](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/danie/ml_projects/~/AppData/Roaming/Python/Python312/site-packages/rfdetr/detr.py:81) self.train_from_config(config, **kwargs)
File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:187, in RFDETR.train_from_config(self, config, **kwargs)
179 early_stopping_callback = EarlyStoppingCallback(
180 model=self.model,
...
--> [549](file:///C:/Users/danie/anaconda3/Lib/site-packages/torch/nn/modules/conv.py:549) return F.conv2d(
550 input, weight, bias, self.stride, self.padding, self.dilation, self.groups
551 )
RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.
I cannot figure out how to fix this issue.
Environment
- RF-DETR: 1.2.1
- OS: Microsoft Windows 11 Pro (64-bit)
- Python: Python 3.13.5
- PyTorch: 2.7.1
- GPU: AMD Radeon Graphics
Minimal Reproducible Example
Note that I am using a custom dataset.
from rfdetr import RFDETRBase
model = RFDETRBase(num_classes=4)
model.train(
dataset_dir='datasets/images_resized',
train_ann_file='datasets/images_resized/train/_annotations.coco.json',
valid_ann_file='datasets/images_resized/valid/_annotations.coco.json',
epochs=10,
batch_size=4,
grad_accum_steps=4,
lr=1e-4
)
Additional
No response
Are you willing to submit a PR?
- [ ] Yes, I'd like to help by submitting a PR!
Weird! We haven't tested on windows or AMD. Can you use an earlier version and see if it works?
We also don't have access to those environments to test 😅
@DanielCruz09 can you confirm it's actually using the GPU?
@DanielCruz09 can you confirm it's actually using the GPU?
I do not think it was using the GPU; I encountered issues with using my GPU (this might be a Windows issue). Does the GPU affect backprop?
It shouldn't but we're seeing other issues where this bug pops up on CPU training. Just trying to figure out if you're seeing this bug because of a different issue or if it's likely the same root cause
I think the issue may possibly be in engine.py. Where 'with.torch.inference_mode():' is used during training 'if args.multi_scale and not args.do_random_resize_via_padding'. Not 100% sure why but changing it to 'with torch.no_grad():' seems to have fixed that error for me.
Good catch @S-Mahoney! Seems like a PyTorch bug. Can you submit a PR and explain how you tested it?
I get the same issue trying to run it locally with MPS, CPU for Macbook pro M4
I ran into the same problem. The first epoch trains correctly, then val evaluation is run and the crash happens on second epoch.
S-Mahoney's solution to change with.torch.inference_mode(): to torch.no_grad() fixes it. Strangely, the error happens even if args.multi_scale is False. I have torch==2.7.0+cu128 and running on latest develop branch commit 9fd9789.
Hi, we have also this issue. We ran it on Windows Machine as well as jupyter lab instances on aws ( linux) on cheaper CPU instances for testing first. it crashes already on first epoch. We tried to train Small & Medium models. Both got this error.
We used the input as in the documentation on google colab (using from module the classes like rfdetr.RFDETRMedium() )and same input parameters for epochs and batch_site
rfdetr=1.3.0 torch=2.8.0
also related issues which seem to be duplications: #368 #367 #366
We'd love to test if this is feasible to replace our own yolo implementation and increase our A.I. performance & quality :)
edit
The deployment with gpu attached works. So this is an CPU only problem. For efficient Training a GPU is needed anyways but the error is very confusioning so maybe an additional error message would already help that Training with cpu is not supported