rf-detr
rf-detr copied to clipboard
Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues
Bug Report: Distributed Training Crashes at End with File Access and State Dict Issues
Description
When running distributed training with multiple GPUs, the training completes successfully but crashes at the very end during the test phase (run_test=True) with two critical errors:
- FileNotFoundError:
checkpoint_best_total.pthfile not found - RuntimeError: State dict key mismatch when loading checkpoint into DistributedDataParallel model
Error Messages
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/hehao/桌面/rf-detr/train.py", line 5, in <module>
[rank1]: model.train(
[rank1]: File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 81, in train
[rank1]: self.train_from_config(config, **kwargs)
[rank1]: File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 187, in train_from_config
[rank1]: self.model.train(
[rank1]: File "/home/hehao/桌面/rf-detr/rfdetr/main.py", line 483, in train
[rank1]: best_state_dict = torch.load(output_dir / 'checkpoint_best_total.pth', map_location='cpu', weights_only=False)['model']
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1484, in load
[rank1]: with _open_file_like(f, "rb") as opened_file:
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 759, in _open_file_like
[rank1]: return _open_file(name_or_buffer, mode)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/serialization.py", line 740, in __init__
[rank1]: super().__init__(open(name, mode))
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'output/checkpoint_best_total.pth'
Training time 7:02:31
Results saved to output/results.json
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/hehao/桌面/rf-detr/train.py", line 5, in <module>
[rank0]: model.train(
[rank0]: File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 81, in train
[rank0]: self.train_from_config(config, **kwargs)
[rank0]: File "/home/hehao/桌面/rf-detr/rfdetr/detr.py", line 187, in train_from_config
[rank0]: self.model.train(
[rank0]: File "/home/hehao/桌面/rf-detr/rfdetr/main.py", line 484, in train
[rank0]: model.load_state_dict(best_state_dict)
[rank0]: File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2624, in load_state_dict
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
[rank0]: Missing key(s) in state_dict: "module.transformer.decoder.layers.0.self_attn.in_proj_weight", ...(全部)
[rank0]: Unexpected key(s) in state_dict: ...(全部)
[rank0]:[W813 02:04:34.428568154 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0813 02:04:35.395000 366050 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 366140 closing signal SIGTERM
E0813 02:04:35.663000 366050 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 366141) of binary: /home/hehao/桌面/rf-detr/.venv/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 207, in <module>
main()
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/typing_extensions.py", line 2956, in wrapper
return arg(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 203, in main
launch(args)
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 188, in launch
run(args)
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hehao/桌面/rf-detr/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-13_02:04:35
host : hehao-Precision-7920-Tower
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 366141)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Root Cause Analysis
1. Race Condition in Distributed Training
The issue occurs because in distributed training:
- Only the main process (rank 0) creates and processes the
checkpoint_best_total.pthfile - Other processes immediately proceed to the
run_testphase without waiting - Non-main processes try to access the file before it's ready, causing FileNotFoundError
2. DistributedDataParallel State Dict Mismatch
When loading the state dict in run_test phase:
- The model is wrapped in
DistributedDataParallelduring distributed training - The checkpoint contains model weights without the
module.prefix - Loading fails because the code doesn't handle DDP model structure properly
Environment
System Information:
- OS: Ubuntu 22.04.4 LTS (Linux 5.15.0-151-generic)
- Architecture: x86_64
- Python: 3.12.11
- Package Manager: uv 0.7.17
Hardware:
- GPUs: 2x NVIDIA GeForce RTX 3090 (24GB each)
- NVIDIA Driver: 535.247.01
- CUDA: 12.8.90
Key Dependencies:
- torch: 2.8.0
- torchvision: 0.23.0
- transformers: 4.55.0
- peft: 0.17.0
- numpy: 2.3.2
- opencv-python: 4.11.0.86
- pillow: 11.3.0
Steps to Reproduce
- Set up distributed training environment with multiple GPUs
- Run training with
run_test=True(default) - Wait for training to complete
- Observe the crash at the end during test phase
Impact
This bug prevents successful completion of distributed training runs, making it impossible to:
- Get final test results after training
- Properly validate model performance on test set
- Complete the full training pipeline
Additional Notes
The training itself completes successfully and produces valid checkpoints. The issue only occurs during the final test evaluation phase, suggesting the core training logic is sound but the post-training evaluation needs better distributed coordination.
Nice debugging! Yeah I think in our research repo there's a sync that happens, guess it didn't make it here. Feel free to submit a PR?
A workaround is to run the eval after training. Seems like that should work?
I have been having the same error message when I do distributed training on AWS. Can't get the validation result. would be nicer if there will be a support for validation only. i.e
model.eval(.....)
And you pass the directory to your test/valid split.