llama3
llama3 copied to clipboard
用了8块a100-40g 运行llama3-70b-instruct 提示如下错误
用了8块a100-40g 运行llama3-70b-instruct 提示如下错误 [2024-04-22 10:52:15,696] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-04-22 10:52:15,696] torch.distributed.run: [WARNING] *****************************************
initializing model parallel with size 8 initializing ddp with size 1 initializing pipeline with size 1 [2024-04-22 10:53:55,894] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7159 closing signal SIGTERM [2024-04-22 10:53:55,966] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7160 closing signal SIGTERM [2024-04-22 10:53:55,966] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7161 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7162 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7163 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7164 closing signal SIGTERM [2024-04-22 10:53:55,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7165 closing signal SIGTERM [2024-04-22 10:53:58,513] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 7158) of binary: /home/vipuser/anaconda3/envs/llm/bin/python3.10 Traceback (most recent call last): File "/home/vipuser/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== example_text_completion.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-04-22_10:53:55 host : pc_0 rank : 0 (local_rank: 0) exitcode : -9 (pid: 7158) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 7158
python-BaseException
运行命令怎么写的
运行命令怎么写的
torchrun --nproc_per_node 8 example_text_completion.py --ckpt_dir Meta-Llama-3-70B-Instruct/ --tokenizer_path Meta-Llama-3-70B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 8
any update lately?