OFA
OFA copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255)
Hi,
when I run"evaluate_vqa_beam.sh test'' the error appear:
Show Dotfiles Show Owner[/](https://ood.hpc.nyu.edu/pun/sys/files/fs/)Mode
/[scratch](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/)/[yz7357](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/)/[OFA-main](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/)/[run_scripts](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts/)/vqa/
/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmpbq7m9imn
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmpbq7m9imn/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmptminbnjh
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmptminbnjh/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmpx44t8ypz
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmpx44t8ypz/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmp9sd587r1
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmp9sd587r1/_remote_module_non_scriptable.py
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255) local_rank: 0 (pid: 761601) of binary: /scratch/yz7357/anaconda/envs/OFA/bin/python3
Traceback (most recent call last):
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
[..](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts)/../evaluate.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-12-06_13:33:07
host : cs016.hpc.nyu.edu
rank : 1 (local_rank: 1)
exitcode : 255 (pid: 761602)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-12-06_13:33:07
host : cs016.hpc.nyu.edu
rank : 2 (local_rank: 2)
exitcode : 255 (pid: 761603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2022-12-06_13:33:07
host : cs016.hpc.nyu.edu
rank : 3 (local_rank: 3)
exitcode : 255 (pid: 761604)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-12-06_13:33:07
host : cs016.hpc.nyu.edu
rank : 0 (local_rank: 0)
exitcode : 255 (pid: 761601)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
and my script :
Show Dotfiles Show Owner[/](https://ood.hpc.nyu.edu/pun/sys/files/fs/)Mode
/[scratch](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/)/[yz7357](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/)/[OFA-main](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/)/[run_scripts](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts/)/vqa/
#!/usr/bin/env bash
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=8082
user_dir=../../ofa_module
bpe_dir=../../utils/BPE
# val or test
split=$1
data=../../dataset/vqa_data/vqa_${split}.tsv
ans2label_file=../../dataset/vqa_data/trainval_ans2label.pkl
path=../../checkpoints/vqa_large_best.pt
result_path=../../results/vqa_${split}_beam
selected_cols=0,5,2,3,4
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../evaluate.py \
${data} \
--path=${path} \
--user-dir=${user_dir} \
--task=vqa_gen \
--batch-size=16 \
--log-format=simple --log-interval=10 \
--seed=7 \
--gen-subset=${split} \
--results-path=${result_path} \
--fp16 \
--ema-eval \
--beam-search-vqa-eval \
--beam=5 \
--unnormalized \
--temperature=1.0 \
--num-workers=0 \
--model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"selected_cols\":\"${selected_cols}\",\"ans2label_file\":\"${ans2label_file}\"}"
You might be able to get more imformation about the error by setting the environment variable export NCCL_DEBUG=INFO
Have you sovled the problem? How?