LLaMA-Factory 单机多卡使用部分卡报错

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

export CUDA_VISIBLE_DEVICES=0,1,2

deepspeed --num_gpus 3 src/train_bash.py
--deepspeed ds_z3_config.json
--stage sft
--do_train
--model_name_or_path /mnt/data/LLaMA-Factory/models/Qwen-14B-Chat
--dataset valDataSet_zn
--template qwen
--finetuning_type lora
--lora_target c_attn
--output_dir saves/model/Qwen-14B-Chat-lora-v1
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 50
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16

ds_z3_config.json { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

Expected behavior

我有4张A100(80g)，我想用3张训练，1张推理。推理设置了显卡3已经运行起来，训练用llama-factory设置了显卡0,1,2不管用accelerate还是deepspeed都报错

设置4张显卡就可以运行训练脚本，没有报错。所以是不支持部分卡做单机多卡训练吗？

System Info

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: XML Import Channel : dev 3 not found.

Others

No response

Apr 07 '24 09:04 521bibi

deepspeed --include localhost:0,1,2 src/train_bash.py
可以试试这样

Apr 08 '24 03:04 bravelyi

deepspeed --include localhost:0,1,2 src/train_bash.py 可以试试这样

不行，报错一致我少用哪张显卡，就报错哪张显卡没找到

Apr 08 '24 07:04 521bibi

问题修复了吗？怎么修复的？

Apr 23 '24 11:04 doladola