VAD
VAD copied to clipboard
Training Problem
Thank you very much for the valuable work of your team, I have completed the validation on the validation set using the pre-trained model.But when I use the following command, I want to train the model:python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base_stage_2.py --launcher pytorch --deterministic --work-dir /data03/work2_data/VAD/trained-net
,As a result, a bug occurred. The following is the specific bug prompt. I hope you can help me solve it. I will be very grateful.`Traceback (most recent call last):
File "tools/train.py", line 266, in
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 119961 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record def trainer_main(args): # do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data03/work2_data/tools/anaconda/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
tools/train.py FAILED
======================================= Root Cause: [0]: time: 2024-09-04_11:32:12 rank: 0 (local_rank: 0) exitcode: 1 (pid: 119961) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: <NO_OTHER_FAILURES>
`