OFA icon indicating copy to clipboard operation
OFA copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255)

Open Zhang815 opened this issue 2 years ago • 2 comments

Hi,

when I run"evaluate_vqa_beam.sh test'' the error appear:

Show Dotfiles Show Owner[/](https://ood.hpc.nyu.edu/pun/sys/files/fs/)Mode
/[scratch](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/)/[yz7357](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/)/[OFA-main](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/)/[run_scripts](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts/)/vqa/
/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmpbq7m9imn
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmpbq7m9imn/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmptminbnjh
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmptminbnjh/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmpx44t8ypz
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmpx44t8ypz/_remote_module_non_scriptable.py
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /state/partition1/job-27704529/tmp9sd587r1
2022-12-06 13:32:42 | INFO | torch.distributed.nn.jit.instantiator | Writing /state/partition1/job-27704529/tmp9sd587r1/_remote_module_non_scriptable.py
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 3): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 2): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 1): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | distributed init (rank 0): env://
2022-12-06 13:33:00 | INFO | fairseq.distributed.utils | Start init
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
Retry: 1, with value error <class 'RuntimeError'>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255) local_rank: 0 (pid: 761601) of binary: /scratch/yz7357/anaconda/envs/OFA/bin/python3
Traceback (most recent call last):
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/scratch/yz7357/anaconda/envs/OFA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
[..](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts)/../evaluate.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-12-06_13:33:07
  host      : cs016.hpc.nyu.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 255 (pid: 761602)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-12-06_13:33:07
  host      : cs016.hpc.nyu.edu
  rank      : 2 (local_rank: 2)
  exitcode  : 255 (pid: 761603)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-12-06_13:33:07
  host      : cs016.hpc.nyu.edu
  rank      : 3 (local_rank: 3)
  exitcode  : 255 (pid: 761604)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-06_13:33:07
  host      : cs016.hpc.nyu.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 255 (pid: 761601)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

and my script :

Show Dotfiles Show Owner[/](https://ood.hpc.nyu.edu/pun/sys/files/fs/)Mode
/[scratch](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/)/[yz7357](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/)/[OFA-main](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/)/[run_scripts](https://ood.hpc.nyu.edu/pun/sys/files/fs/scratch/yz7357/OFA-main/run_scripts/)/vqa/
#!/usr/bin/env bash

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=8082

user_dir=../../ofa_module
bpe_dir=../../utils/BPE

# val or test
split=$1

data=../../dataset/vqa_data/vqa_${split}.tsv
ans2label_file=../../dataset/vqa_data/trainval_ans2label.pkl
path=../../checkpoints/vqa_large_best.pt
result_path=../../results/vqa_${split}_beam
selected_cols=0,5,2,3,4

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=vqa_gen \
    --batch-size=16 \
    --log-format=simple --log-interval=10 \
    --seed=7 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --fp16 \
    --ema-eval \
    --beam-search-vqa-eval \
    --beam=5 \
    --unnormalized \
    --temperature=1.0 \
    --num-workers=0 \
    --model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"selected_cols\":\"${selected_cols}\",\"ans2label_file\":\"${ans2label_file}\"}"

Zhang815 avatar Dec 06 '22 18:12 Zhang815

You might be able to get more imformation about the error by setting the environment variable export NCCL_DEBUG=INFO

JackCai1206 avatar Jan 10 '23 05:01 JackCai1206

Have you sovled the problem? How?

SilyRab avatar Mar 25 '23 07:03 SilyRab