llama icon indicating copy to clipboard operation
llama copied to clipboard

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 760)

Open bechellis opened this issue 1 year ago • 5 comments

Hi everybody,

I tried to deploy the llama2 model in pytorch/CUDA env:

CUDA version: 12.1 ID of current CUDA device: 0 Name of current CUDA device: Quadro P4000

but I found the following issue, has someone an idea of what's wrong?

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 [2023-10-26 11:56:24,266] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2283) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/user/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== example_chat_completion.py FAILED


Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-10-26_11:56:22 host : rank : 0 (local_rank: 0) exitcode : -9 (pid: 2283) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 2283

Runtime Environment

  • Model: [eg: llama-2-7b-chat]
  • Using via huggingface?: [yes/no]
  • OS: Ubuntu
  • GPU VRAM: 8GB
  • Number of GPUs: 4
  • GPU Make: Nvidia Quadro P4000

bechellis avatar Oct 26 '23 10:10 bechellis

Facing same issue here

WieMaKa avatar Oct 29 '23 03:10 WieMaKa

Please share the full stacktrace which contains the actual error.

subramen avatar Nov 01 '23 16:11 subramen

when I see this issue, I actually dont see any other stack trace. the full log starts with torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) which is the same as the post here

lmelinda avatar Feb 27 '24 18:02 lmelinda

Facing same issue here!!

Following the same step, and env is Google Colab T4 GPU

amew0 avatar Mar 23 '24 22:03 amew0

in my case CPU going out of memory seems to contribute to it

lmelinda avatar Mar 25 '24 14:03 lmelinda