LLaMA-Factory
LLaMA-Factory copied to clipboard
多机多卡训练,在运行程序时,出现了socketFinalizeAccept: wrong type 3 != 4错误。但是如果我在运行时指定NCCL_IB_DISABLE=1,那么程序就可以正常运行。我应该怎么解决这个错误。
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
test.yaml
model
model_name_or_path: /home/data/Meta-Llama-3-8B-Instruct
method
stage: sft do_train: true finetuning_type: lora lora_target: all use_dora : true flash_attn : auto
dataset
dataset: Survey_Gen template: llama3 cutoff_len: 1024 max_samples: 1000 val_size: 0 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: /home/cx/LLaMA-Factory/saves/LLaMA3-8B-Chat/lora/test logging_steps: 5 save_steps: 50 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 100 lr_scheduler_type: cosine warmup_steps: 0.1 fp16: true lora_rank : 8 lora_alpha : 16 lora_dropout : 0 optim : adamw_torch ddp_find_unused_parameters : false
ddp
ddp_timeout: 180000000
multi_config.yaml
compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: 49.122.1.6 main_process_port: 29555 main_training_function: main mixed_precision: fp16 num_machines: 2 # the number of nodes num_processes: 2 # the number of GPUs in all nodes rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Expected behavior
多机多卡训练,在运行程序时,出现了socketFinalizeAccept: wrong type 3 != 4错误。但是如果我在运行时指定NCCL_IB_DISABLE=1,那么程序就可以正常运行。我应该怎么解决这个错误。
System Info
[rank0]: File "/home/cx/anaconda3/envs/RAG/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier [rank0]: work = default_pg.barrier(opts=opts) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: [rank0]: socketFinalizeAccept: wrong type 3 != 4 E0508 21:35:50.822000 135813143611200 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3699464) of binary: /home/cx/anaconda3/envs/RAG/bin/python
Others
No response