stanford_alpaca
stanford_alpaca copied to clipboard
Signal 7 error while finetuning with deepspeed
I am trying to run the finetuning script using 8 32GB V100 GPUs. I am using the torchrun command for using deepspeed with both parameter and optimizer offload, with a few minor modifications:
torchrun --nproc_per_node=8 --master_port=3030 train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--fp16 True \
--output_dir output \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--deepspeed "./configs/default_opt_param.json"
I am running into the following errors:
Traceback (most recent call last):
File "/root/chat-llm/stanford_alpaca/train.py", line 222, in <module>
train()
File "/root/chat-llm/stanford_alpaca/train.py", line 186, in train
model = transformers.LlamaForCausalLM.from_pretrained(
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2498, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
f(module, *args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 659, in __init__
self.model = LlamaModel(config)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
f(module, *args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 463, in __init__
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 389, in wrapper
self._post_init_method(module)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 782, in _post_init_method
dist.broadcast(param, 0, self.ds_process_group)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
return func(*args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 81, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
return func(*args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1551, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Call to recv from 10.233.121.250<45143> failed : Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36604 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36605 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 36601) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
-----------------------------------------------------
Failures:
[1]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 36602)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36602
[2]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 36603)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36603
[3]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 5 (local_rank: 5)
exitcode : -7 (pid: 36606)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36606
[4]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 6 (local_rank: 6)
exitcode : -7 (pid: 36607)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36607
[5]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 7 (local_rank: 7)
exitcode : -7 (pid: 36608)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36608
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-18_15:47:13
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 36601)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 36601
=====================================================
Here is my nvcc version:
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
and nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
Please let me know if I can provide any other information to identify the source of this issue. I would highly appreciate any help or guidance on how to make this work.
me too. but ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)
I also encountered this issue with exitcode: -9. Are there any updates on this?
me too, my config: V100 16G * 4, CPU RAM 128G , how to solve this problem?
me too
try one gpu, modify this parameter --nproc_per_node=1
Any solution ? 2xRTX 3090, same error :( !!!!!