UniAD
UniAD copied to clipboard
Dataloader worker killed with runtime error.
Hello,
While training stage to network, im seeing the following error.
Is anyone seeing the same error?
Traceback (most recent call last):
File "./tools/train.py", line 256, in
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 1099909 (local_rank 1) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record def trainer_main(args): # do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
./tools/train.py FAILED
======================================= Root Cause: [0]: time: 2023-07-07_12:12:31 rank: 1 (local_rank: 1) exitcode: 1 (pid: 1099909) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: <NO_OTHER_FAILURES>
Thanks for your attention. I'm training this on an AWS EC2 instance (g5-12x) with 4 A10 gpus!
Regards, Venkat
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1
projects.mmdet3d_plugin
Traceback (most recent call last):
File "./tools/train.py", line 256, in <module>
main()
File "./tools/train.py", line 173, in main
cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump
f.write(self.pretty_text)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
./tools/train.py FAILED
=======================================
Root Cause:
[0]:
time: 2023-10-27_14:46:08
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 59905)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1 projects.mmdet3d_plugin Traceback (most recent call last): File "./tools/train.py", line 256, in <module> main() File "./tools/train.py", line 173, in main cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config))) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump f.write(self.pretty_text) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text text, _ = FormatCode(text, style_config=yapf_style, verify=True) TypeError: FormatCode() got an unexpected keyword argument 'verify' /usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** ./tools/train.py FAILED ======================================= Root Cause: [0]: time: 2023-10-27_14:46:08 rank: 0 (local_rank: 0) exitcode: 1 (pid: 59905) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***************************************
Have you solved this ?
Hello,
I did solve this problem. May I ask when you are hitting this issue?
If i remember correctly, I was hitting this issue during validation check and i needed to enable the following flag which fixed it.
NCCL_P2P_DISABLE=1
Thanks Venkat
On Mon, Jan 8, 2024 at 7:48 AM xiexu666 @.***> wrote:
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1 projects.mmdet3d_plugin Traceback (most recent call last): File "./tools/train.py", line 256, in
main() File "./tools/train.py", line 173, in main cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config))) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump f.write(self.pretty_text) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text text, _ = FormatCode(text, style_config=yapf_style, verify=True) TypeError: FormatCode() got an unexpected keyword argument 'verify' /usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main() File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** ./tools/train.py FAILED Root Cause: [0]: time: 2023-10-27_14:46:08 rank: 0 (local_rank: 0) exitcode: 1 (pid: 59905) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: <NO_OTHER_FAILURES>***************************************
Have you solved this ?
— Reply to this email directly, view it on GitHub https://github.com/OpenDriveLab/UniAD/issues/62#issuecomment-1880944181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNVYTRI3YNH2IRUJMC56U3YNPTJJAVCNFSM6AAAAAA2B3WXN2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQHE2DIMJYGE . You are receiving this because you authored the thread.Message ID: @.***>
@xiexu666 @daxiongpro
Hello,
execute the following command to resolve this problem:
$pip uninstall yapf
$pip install yapf==0.40.1
refer:https://blog.csdn.net/ZZZZ_Y_/article/details/133902230