rf-detr icon indicating copy to clipboard operation
rf-detr copied to clipboard

Error in resuming training from checkpoint

Open ivdorelian opened this issue 1 month ago • 2 comments

Search before asking

  • [x] I have searched the RF-DETR issues and found no similar bug report.

Bug

When using model.train with the resume argument and a checkpoint path, an error occurs at line 483 of main.py:

  File "/usr/local/lib/python3.12/site-packages/rfdetr/detr.py", line 83, in train
    self.train_from_config(config, **kwargs)
  File "/usr/local/lib/python3.12/site-packages/rfdetr/detr.py", line 191, in train_from_config
    self.model.train(
  File "/usr/local/lib/python3.12/site-packages/rfdetr/main.py", line 483, in train
    results = test_stats["results_json"]
              ^^^^^^^^^^
UnboundLocalError: cannot access local variable 'test_stats' where it is not associated with a value

Environment

  • RF-DETR: 1.3.0
  • OS: Modal / debian slim
  • Python: 3.12.12
  • GPU: Nvidia H100

Minimal Reproducible Example

Run some training, I only tried with the segmentation model:

from rfdetr import RFDETRSegPreview

model = RFDETRSegPreview()

model.train(
    dataset_dir="DATASET_DIR",
    epochs=2,
    batch_size=1,
    grad_accum_steps=1,
    lr=1e-4,
    output_dir="OUTPUT_DIR"
    )

Then try to continue the training from the latest checkpoint. Note that the same happens if you try to continue from an epoch checkpoint.

model = RFDETRSegPreview()

model.train(
    dataset_dir="DATASET_DIR",
    epochs=2,
    batch_size=1,
    grad_accum_steps=1,
    lr=1e-4,
    output_dir="OUTPUT_DIR",
    resume="OUTPUT_DIR/checkpoint.pth"
    )

Additional

It looks like the issue might be with the line 297 to 304 of main.py:

            if not args.eval and 'optimizer' in checkpoint and 'lr_scheduler' in checkpoint and 'epoch' in checkpoint:                
                optimizer.load_state_dict(checkpoint['optimizer'])
                lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
                args.start_epoch = checkpoint['epoch'] + 1

        if args.eval:
            test_stats, coco_evaluator = evaluate(
                model, criterion, postprocess, data_loader_val, base_ds, device, args)

args.eval is False by default, so the test_stats variable does not get populated. If I pass True to it, the error goes away, but that's not correct either, as then the optimizer state won't load and the training won't happen at all, it will only evaluate the model on the test set as far as I can tell...

Maybe the evaluate(...) call and assignments should happen regardless of args.eval?

Are you willing to submit a PR?

  • [x] Yes, I'd like to help by submitting a PR!

ivdorelian avatar Nov 16 '25 18:11 ivdorelian

I encounter the same bug, did you find any solutions?

kkkkken33 avatar Nov 18 '25 06:11 kkkkken33

I encounter the same bug, did you find any solutions?

I add "device=device" in model.train() to resume training and the error has gone.

kkkkken33 avatar Nov 18 '25 06:11 kkkkken33

After digging into this, the root cause is not actually args.eval, but a resume + epochs logic bug.

When resuming, RF-DETR does:

args.start_epoch = checkpoint["epoch"] + 1

If the user resumes from a checkpoint whose epoch is ≥ epochs, the training loop never runs even once. For example:

checkpoint epoch = 1 epochs = 2 → start_epoch = 2 → range(start_epoch, epochs) is empty

In this case:

no training step runs

no validation/test runs

test_stats is never created

However, at the end of train() RF-DETR still unconditionally executes:

results = test_stats["results_json"]

which causes:

UnboundLocalError: test_stats not defined

This is why:

the first training run works

resuming with a small epochs value crashes

setting eval=True “fixes” it (because it forces evaluate() to run), but that is not correct since it skips training and optimizer loading

In RF-DETR, epochs currently means total epochs, not epochs to run after resume. So when resuming from epoch N, epochs must be > N (e.g. resume at 115 → epochs ≥ 116).

Suggested fixes

Any of the following would resolve this cleanly:

Guard access to test_stats:

if "test_stats" in locals(): results = test_stats["results_json"]

Detect invalid resume configuration early:

if args.start_epoch >= args.epochs: logger.warning("start_epoch >= epochs, no training will run") return

This would prevent the crash and make the behavior much clearer for users.

AhaggachHamid avatar Dec 12 '25 15:12 AhaggachHamid