Have you ever encountered this problem when executing training statements
./tools/dist_ Train.sh./projects/configs/bevformer/bevformer_ Base.py 1
When an error occurs, what should be done to resolve it
Using one GPU
Traceback (most recent call last):
File "./tools/train.py", line 263, in
main()
File "./tools/train.py", line 252, in main
custom_train_model(
File "/root/autodl-tmp/BEVFormer-master/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model
custom_train_detector(
File "/root/autodl-tmp/BEVFormer-master/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 107, in custom_train_detector
assert cfg.total_epochs == cfg.runner.max_epochs
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 901) of binary: /root/miniconda3/envs/bev/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/train.py FAILED