llama
llama copied to clipboard
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 760)
Hi everybody,
I tried to deploy the llama2 model in pytorch/CUDA env:
CUDA version: 12.1 ID of current CUDA device: 0 Name of current CUDA device: Quadro P4000
but I found the following issue, has someone an idea of what's wrong?
torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 [2023-10-26 11:56:24,266] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2283) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/user/.local/bin/torchrun", line 8, in
sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== example_chat_completion.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-10-26_11:56:22 host : rank : 0 (local_rank: 0) exitcode : -9 (pid: 2283) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 2283
Runtime Environment
- Model: [eg:
llama-2-7b-chat
] - Using via huggingface?: [yes/no]
- OS: Ubuntu
- GPU VRAM: 8GB
- Number of GPUs: 4
- GPU Make: Nvidia Quadro P4000
Facing same issue here
Please share the full stacktrace which contains the actual error.
when I see this issue, I actually dont see any other stack trace. the full log starts with torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9)
which is the same as the post here
Facing same issue here!!
Following the same step, and env is Google Colab T4 GPU
in my case CPU going out of memory seems to contribute to it