单机多卡使用部分卡报错
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
export CUDA_VISIBLE_DEVICES=0,1,2
deepspeed --num_gpus 3 src/train_bash.py
--deepspeed ds_z3_config.json
--stage sft
--do_train
--model_name_or_path /mnt/data/LLaMA-Factory/models/Qwen-14B-Chat
--dataset valDataSet_zn
--template qwen
--finetuning_type lora
--lora_target c_attn
--output_dir saves/model/Qwen-14B-Chat-lora-v1
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 50
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
ds_z3_config.json { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
Expected behavior
我有4张A100(80g),我想用3张训练,1张推理。 推理设置了显卡3已经运行起来,训练用llama-factory设置了显卡0,1,2不管用accelerate还是deepspeed都报错
设置4张显卡就可以运行训练脚本,没有报错。 所以是不支持部分卡做单机多卡训练吗?
System Info
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: XML Import Channel : dev 3 not found.
Others
No response
deepspeed --include localhost:0,1,2 src/train_bash.py
可以试试这样
deepspeed --include localhost:0,1,2 src/train_bash.py 可以试试这样
不行,报错一致
我少用哪张显卡,就报错哪张显卡没找到
问题修复了吗?怎么修复的?