Bug Report: Distributed Training Crashes at End with File Access and State Dict Issues

Description

When running distributed training with multiple GPUs, the training completes successfully but crashes at the very end during the test phase (run_test=True) with two critical errors:

FileNotFoundError: checkpoint_best_total.pth file not found
RuntimeError: State dict key mismatch when loading checkpoint into DistributedDataParallel model

Error Messages

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/hehao/桌面/rf-detr/train.py", line 5, in <module>
[rank1]:     model.train(
[rank1]:   File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 81, in train
[rank1]:     self.train_from_config(config, **kwargs)
[rank1]:   File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 187, in train_from_config
[rank1]:     self.model.train(
[rank1]:   File "/home/hehao/桌面/rf-detr/rfdetr/main.py", line 483, in train
[rank1]:     best_state_dict = torch.load(output_dir / 'checkpoint_best_total.pth', map_location='cpu', weights_only=False)['model']
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1484, in load
[rank1]:     with _open_file_like(f, "rb") as opened_file:
[rank1]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 759, in _open_file_like
[rank1]:     return _open_file(name_or_buffer, mode)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 740, in __init__
[rank1]:     super().__init__(open(name, mode))
[rank1]:                      ^^^^^^^^^^^^^^^^
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'output/checkpoint_best_total.pth'
Training time 7:02:31
Results saved to output/results.json
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hehao/桌面/rf-detr/train.py", line 5, in <module>
[rank0]:     model.train(
[rank0]:   File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 81, in train
[rank0]:     self.train_from_config(config, **kwargs)
[rank0]:   File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 187, in train_from_config
[rank0]:     self.model.train(
[rank0]:   File "/home/hehao/桌面/rf-detr/rfdetr/main.py", line 484, in train
[rank0]:     model.load_state_dict(best_state_dict)
[rank0]:   File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2624, in load_state_dict
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
[rank0]:        Missing key(s) in state_dict: "module.transformer.decoder.layers.0.self_attn.in_proj_weight", ...(全部)
 
[rank0]:        Unexpected key(s) in state_dict: ...(全部)
[rank0]:[W813 02:04:34.428568154 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0813 02:04:35.395000 366050 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 366140 closing signal SIGTERM
E0813 02:04:35.663000 366050 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 366141) of binary: /home/hehao/桌面/rf-detr/.venv/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 207, in <module>
    main()
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/typing_extensions.py", line 2956, in wrapper
    return arg(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 203, in main
    launch(args)
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 188, in launch
    run(args)
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-13_02:04:35
  host      : hehao-Precision-7920-Tower
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 366141)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Root Cause Analysis

1. Race Condition in Distributed Training

The issue occurs because in distributed training:

Only the main process (rank 0) creates and processes the checkpoint_best_total.pth file
Other processes immediately proceed to the run_test phase without waiting
Non-main processes try to access the file before it's ready, causing FileNotFoundError

2. DistributedDataParallel State Dict Mismatch

When loading the state dict in run_test phase:

The model is wrapped in DistributedDataParallel during distributed training
The checkpoint contains model weights without the module. prefix
Loading fails because the code doesn't handle DDP model structure properly

Environment

System Information:

OS: Ubuntu 22.04.4 LTS (Linux 5.15.0-151-generic)
Architecture: x86_64
Python: 3.12.11
Package Manager: uv 0.7.17

Hardware:

GPUs: 2x NVIDIA GeForce RTX 3090 (24GB each)
NVIDIA Driver: 535.247.01
CUDA: 12.8.90

Key Dependencies:

torch: 2.8.0
torchvision: 0.23.0
transformers: 4.55.0
peft: 0.17.0
numpy: 2.3.2
opencv-python: 4.11.0.86
pillow: 11.3.0

Steps to Reproduce

Set up distributed training environment with multiple GPUs
Run training with run_test=True (default)
Wait for training to complete
Observe the crash at the end during test phase

Impact

This bug prevents successful completion of distributed training runs, making it impossible to:

Get final test results after training
Properly validate model performance on test set
Complete the full training pipeline

Additional Notes

The training itself completes successfully and produces valid checkpoints. The issue only occurs during the final test evaluation phase, suggesting the core training logic is sound but the post-training evaluation needs better distributed coordination.

Aug 13 '25 01:08 yyq19990828

Nice debugging! Yeah I think in our research repo there's a sync that happens, guess it didn't make it here. Feel free to submit a PR?

A workaround is to run the eval after training. Seems like that should work?

Aug 13 '25 15:08 isaacrob-roboflow

I have been having the same error message when I do distributed training on AWS. Can't get the validation result. would be nicer if there will be a support for validation only. i.e

model.eval(.....)

And you pass the directory to your test/valid split.

Aug 13 '25 18:08 Prezzo-K

rf-detr
rf-detr copied to clipboard

Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues

Bug Report: Distributed Training Crashes at End with File Access and State Dict Issues

Description

Error Messages

Root Cause Analysis

1. Race Condition in Distributed Training

2. DistributedDataParallel State Dict Mismatch

Environment

Steps to Reproduce

Impact

Additional Notes

rf-detr rf-detr copied to clipboard

Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues

Bug Report: Distributed Training Crashes at End with File Access and State Dict Issues

Description

Error Messages

Root Cause Analysis

1. Race Condition in Distributed Training

2. DistributedDataParallel State Dict Mismatch

Environment

Steps to Reproduce

Impact

Additional Notes

rf-detr
rf-detr copied to clipboard