LLama单机多卡全参数微调:RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
deepspeed --num_gpus=8 --master_port=9901 src/train_bash.py
--deepspeed /cpfs01/shared/Group-m6/dongguanting.dgt/LLaMA-Factory-main/config/ds_2.json
--stage sft
--do_train
--model_name_or_path $path_to_llama_model
--dataset $dataset
--template alpaca
--finetuning_type full
--output_dir $output_dir
--overwrite_cache
--overwrite_output_dir
--per_device_train_batch_size 8
--gradient_accumulation_steps 4
--lr_scheduler_type constant
--logging_steps 1
--save_total_limit 1
--save_strategy epoch
--learning_rate 1e-5
--num_train_epochs 3.0
--plot_loss
--bf16 True
--tf32 True
ds_2.json如下:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": "auto"
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": false,
"contiguous_gradients": true
}
}
Expected behavior
No response
System Info
Running tokenizer on dataset: 1%|▏ | 1000/69639 [00:15<17:17, 66.15 examples/s]Traceback (most recent call last):
File "/cpfs01/shared/Group-m6/dongguanting.dgt/LLaMA-Factory-main/src/train_bash.py", line 14, in
Others
No response
所以有没有全参数单机多卡的标准指令?