Error in resuming training from checkpoint
Search before asking
- [x] I have searched the RF-DETR issues and found no similar bug report.
Bug
When using model.train with the resume argument and a checkpoint path, an error occurs at line 483 of main.py:
File "/usr/local/lib/python3.12/site-packages/rfdetr/detr.py", line 83, in train
self.train_from_config(config, **kwargs)
File "/usr/local/lib/python3.12/site-packages/rfdetr/detr.py", line 191, in train_from_config
self.model.train(
File "/usr/local/lib/python3.12/site-packages/rfdetr/main.py", line 483, in train
results = test_stats["results_json"]
^^^^^^^^^^
UnboundLocalError: cannot access local variable 'test_stats' where it is not associated with a value
Environment
- RF-DETR: 1.3.0
- OS: Modal / debian slim
- Python: 3.12.12
- GPU: Nvidia H100
Minimal Reproducible Example
Run some training, I only tried with the segmentation model:
from rfdetr import RFDETRSegPreview
model = RFDETRSegPreview()
model.train(
dataset_dir="DATASET_DIR",
epochs=2,
batch_size=1,
grad_accum_steps=1,
lr=1e-4,
output_dir="OUTPUT_DIR"
)
Then try to continue the training from the latest checkpoint. Note that the same happens if you try to continue from an epoch checkpoint.
model = RFDETRSegPreview()
model.train(
dataset_dir="DATASET_DIR",
epochs=2,
batch_size=1,
grad_accum_steps=1,
lr=1e-4,
output_dir="OUTPUT_DIR",
resume="OUTPUT_DIR/checkpoint.pth"
)
Additional
It looks like the issue might be with the line 297 to 304 of main.py:
if not args.eval and 'optimizer' in checkpoint and 'lr_scheduler' in checkpoint and 'epoch' in checkpoint:
optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
args.start_epoch = checkpoint['epoch'] + 1
if args.eval:
test_stats, coco_evaluator = evaluate(
model, criterion, postprocess, data_loader_val, base_ds, device, args)
args.eval is False by default, so the test_stats variable does not get populated. If I pass True to it, the error goes away, but that's not correct either, as then the optimizer state won't load and the training won't happen at all, it will only evaluate the model on the test set as far as I can tell...
Maybe the evaluate(...) call and assignments should happen regardless of args.eval?
Are you willing to submit a PR?
- [x] Yes, I'd like to help by submitting a PR!
I encounter the same bug, did you find any solutions?
I encounter the same bug, did you find any solutions?
I add "device=device" in model.train() to resume training and the error has gone.
After digging into this, the root cause is not actually args.eval, but a resume + epochs logic bug.
When resuming, RF-DETR does:
args.start_epoch = checkpoint["epoch"] + 1
If the user resumes from a checkpoint whose epoch is ≥ epochs, the training loop never runs even once. For example:
checkpoint epoch = 1 epochs = 2 → start_epoch = 2 → range(start_epoch, epochs) is empty
In this case:
no training step runs
no validation/test runs
test_stats is never created
However, at the end of train() RF-DETR still unconditionally executes:
results = test_stats["results_json"]
which causes:
UnboundLocalError: test_stats not defined
This is why:
the first training run works
resuming with a small epochs value crashes
setting eval=True “fixes” it (because it forces evaluate() to run), but that is not correct since it skips training and optimizer loading
In RF-DETR, epochs currently means total epochs, not epochs to run after resume. So when resuming from epoch N, epochs must be > N (e.g. resume at 115 → epochs ≥ 116).
Suggested fixes
Any of the following would resolve this cleanly:
Guard access to test_stats:
if "test_stats" in locals(): results = test_stats["results_json"]
Detect invalid resume configuration early:
if args.start_epoch >= args.epochs: logger.warning("start_epoch >= epochs, no training will run") return
This would prevent the crash and make the behavior much clearer for users.