多机多卡训练，在运行程序时，出现了socketFinalizeAccept: wrong type 3 != 4错误。但是如果我在运行时指定NCCL_IB_DISABLE=1，那么程序就可以正常运行。我应该怎么解决这个错误。

Open mumu029 opened this issue 9 months ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

test.yaml

model

model_name_or_path: /home/data/Meta-Llama-3-8B-Instruct

method

stage: sft do_train: true finetuning_type: lora lora_target: all use_dora : true flash_attn : auto

dataset

dataset: Survey_Gen template: llama3 cutoff_len: 1024 max_samples: 1000 val_size: 0 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /home/cx/LLaMA-Factory/saves/LLaMA3-8B-Chat/lora/test logging_steps: 5 save_steps: 50 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 100 lr_scheduler_type: cosine warmup_steps: 0.1 fp16: true lora_rank : 8 lora_alpha : 16 lora_dropout : 0 optim : adamw_torch ddp_find_unused_parameters : false

ddp

ddp_timeout: 180000000

multi_config.yaml

compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: 49.122.1.6 main_process_port: 29555 main_training_function: main mixed_precision: fp16 num_machines: 2 # the number of nodes num_processes: 2 # the number of GPUs in all nodes rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Expected behavior

多机多卡训练，在运行程序时，出现了socketFinalizeAccept: wrong type 3 != 4错误。但是如果我在运行时指定NCCL_IB_DISABLE=1，那么程序就可以正常运行。我应该怎么解决这个错误。

System Info

[rank0]: File "/home/cx/anaconda3/envs/RAG/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier [rank0]: work = default_pg.barrier(opts=opts) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: [rank0]: socketFinalizeAccept: wrong type 3 != 4 E0508 21:35:50.822000 135813143611200 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3699464) of binary: /home/cx/anaconda3/envs/RAG/bin/python

Others

No response

May 08 '24 14:05 mumu029

LLaMA-Factory LLaMA-Factory copied to clipboard

多机多卡训练，在运行程序时，出现了socketFinalizeAccept: wrong type 3 != 4错误。但是如果我在运行时指定NCCL_IB_DISABLE=1，那么程序就可以正常运行。我应该怎么解决这个错误。

Reminder

Reproduction

test.yaml

model

method

dataset

output

train

ddp

multi_config.yaml

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard