Describe the bug

I want to perform multi-node fine-tuning using deepspeed and Llama Factory. I offload my model to cpu-cpu stage3. Using LAN and Thunderbolt for SSH connections, the connection functions work normally. When I connected using LAN, the fine-tuning works normally, but when I switch to Thunderbolt, it gets stuck. Unfortunately, there are no error messages in the program. Has anyone encountered the same issue?

To Reproduce

package version:

torch=2.2.2+cu121
torchaudio=2.2.2+cu121
torchvision=0.17.2+cu121
deepspeed=0.13.4
transformers=4.39.3

system info:

system 1:
- MB : gigabyte z790 XTREME
- GPU: RTX 4090
system 2:
- MB : gigabyte z790 XTREME
- GPU: RTX 4080

training script:

$gpus=1

$train_batch=1


deepspeed --hostfile=../../hostfile.txt \
    --num_gpus $gpus ../../src/train_bash.py \
    --deepspeed ds_z3_offload_cpu_cpu_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --output_dir ../../saves/llama3_lorafull/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len $seq_len \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --per_device_train_batch_size $train_batch \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --lora_target q_proj,v_proj \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --bf16 2>&1 | tee $filename

hostfile:

192.168.1.1 slots=1
192.168.1.2 slots=1

offload config:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  
   "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
  
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
                "device": "cpu",
                "nvme_path": "/media/user/nvme0",
                "pin_memory": true,
                "ratio": 1,
                "buffer_count": 5,
                "fast_init": false
            },
    "offload_param": {
                "device": "cpu",
                "nvme_path": "/media/user/nvme0",
                "pin_memory": true,
                "buffer_count": 5,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            },
            
    "round_robin_gradients": true,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

LOG:(The program is stuck on the last line.)

[2024-07-11 11:16:24,229] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 11:16:24,855] [INFO] [runner.py:463:main] Using IP address of 172.17.0.1 for node 192.168.1.1
[2024-07-11 11:16:24,856] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: 192.168.1.1,192.168.1.2
[2024-07-11 11:16:24,856] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w 192.168.1.1,192.168.1.2 export NCCL_P2P_DISABLE=1; export NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0; export NCCL_DEBUG=DEBUG; export NCCL_IB_DISABLE=1; export PYTHONPATH=/home/trx50/Project/factory/LLaMA-Factory/examples/full_multi_gpu;  cd /home/trx50/Project/factory/LLaMA-Factory/examples/full_multi_gpu; /home/trx50/.virtualenvs/factory/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxOTIuMTY4LjEuMSI6IFswXSwgIjE5Mi4xNjguMS4yIjogWzBdfQ== --node_rank=%n --master_addr=172.17.0.1 --master_port=29500 ../../src/train_bash.py --deepspeed ds_z3_offload_cpu_cpu_config.json --stage sft --do_train --model_name_or_path meta-llama/Llama-2-7b-chat-hf --dataset alpaca_gpt4_en --dataset_dir ../../data --template default --finetuning_type lora --output_dir ../../saves/llama3_lorafull/sft --overwrite_cache --overwrite_output_dir --cutoff_len 2048 --lr_scheduler_type cosine --logging_steps 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --eval_steps 100 --evaluation_strategy steps --learning_rate 5e-5 --num_train_epochs 1.0 --lora_target q_proj,v_proj --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --max_samples 3000 --val_size 0.1 --plot_loss --bf16
192.168.1.1: [2024-07-11 11:16:25,980] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.2: [2024-05-18 06:57:44,090] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=DEBUG
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.1.1': [0], '192.168.1.2': [0]}
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.1.1': [0], '192.168.1.2': [1]})
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:163:main] dist_world_size=2
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:253:main] process 22275 spawned with command: ['/home/trx50/.virtualenvs/factory/bin/python', '-u', '../../src/train_bash.py', '--local_rank=0', '--deepspeed', 'ds_z3_offload_cpu_cpu_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', 'meta-llama/Llama-2-7b-chat-hf', '--dataset', 'alpaca_gpt4_en', '--dataset_dir', '../../data', '--template', 'default', '--finetuning_type', 'lora', '--output_dir', '../../saves/llama3_lorafull/sft', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--lora_target', 'q_proj,v_proj', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--max_samples', '3000', '--val_size', '0.1', '--plot_loss', '--bf16']
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_P2P_DISABLE=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_DEBUG=DEBUG
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_IB_DISABLE=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.1.1': [0], '192.168.1.2': [0]}
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.1.1': [0], '192.168.1.2': [1]})
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:163:main] dist_world_size=2
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:253:main] process 11393 spawned with command: ['/home/trx50/.virtualenvs/factory/bin/python', '-u', '../../src/train_bash.py', '--local_rank=0', '--deepspeed', 'ds_z3_offload_cpu_cpu_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', 'meta-llama/Llama-2-7b-chat-hf', '--dataset', 'alpaca_gpt4_en', '--dataset_dir', '../../data', '--template', 'default', '--finetuning_type', 'lora', '--output_dir', '../../saves/llama3_lorafull/sft', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--lora_target', 'q_proj,v_proj', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--max_samples', '3000', '--val_size', '0.1', '--plot_loss', '--bf16']
192.168.1.2: [2024-05-18 06:57:45,536] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:27,773] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:28,240] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.1.1: [2024-07-11 11:16:28,241] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
192.168.1.2: [2024-05-18 06:57:46,373] [INFO] [comm.py:637:init_distributed] cdb=None

Expected behavior

Execute the training normally and produce results.

Jul 11 '24 03:07 Raywang0211

Hi @Raywang0211 - that's very interesting, but the lack of error messages will make this hard to debug. Are you able to confirm ssh connections work fine normally/without DeepSpeed?

Jul 12 '24 16:07 loadams

Hi @loadams : I just found out that the main problem of this issue is the motherboard. I changed another motherboard and the problem was solved. But i found another interesting issue, there was some strange result during multi-node training by 10G Lan or 40G thunderbolt. I conducted two experiments. I tested transfer speed using 10G lan and 40G thunderbolt by iperf. The results like below:

	Specifications transfer speed(bps)	iperf3(bps)
thunderbolt	40G	16.2G
lan	10G	9.37G

I tested multi-node training using LAN and Thunderbolt, and the results are as follows:

connect	batch_size	total_time	token/sec
10G lan	18	148.37	3533.6
40G thunderbolt	18	168	3124.48

I have two questions:

The results seem normal on 10G LAN, but they are strange on 40G Thunderbolt.
Why is the transfer speed of Thunderbolt faster, but the final token/sec is worse than LAN? Both 10G LAN and 40G Thunderbolt's tx/rx are about 1.01G.

Jul 22 '24 08:07 Raywang0211

Hi all 👋 I came to this thread while also trying to setup a small cluster for distributed training and wanted to use thunderbolt over ethernet because of the potential theoretical speed gains.

From what I learned, the speed is limited because of a driver limit. thunderbolt_net currently exposes the Thunderbolt link as a virtual 10 GbE interface, not as raw PCIe. This means the max TCP bandwidth is roughly what 10 GbE can do — ~12-16 Gbps is normal here. So even with USB4 physically capable of 40 Gbps, the Linux Ethernet driver cannot fully saturate it.

There is however a theoretical way one can improve this using vfio to expose the GPU from host B directly to host A so that from NCCL's perspective, it's reachable over P2P/PCIe, not TCP. Of course, this would mean that you'll gain a second gpu on host A and NCCL sees it as a second local GPU so you'll lose the distributed sense from the setup (you won't be able to offload to host B's RAM or NVMes or use its processing resources, etc)

I still didn't try it out and was looking around to see if someone successfully did it.

Another theoretical and experimental way to go is the hybrid way. Meaning setting up the USB4/Thunderbolt port as a PCIe tunnel for the host B's GPU on host A (that way you have a node with 2 gpus instead of one) and use another TCP link between the nodes (in your case, the 10Gbps ethernet - which is already good by the way) for the multinode setup and potential gains in memory (for offloading) and compute.

Unfortunately I can't confirm or try this on my setup cause I'm limited by the 2.5Gbps ports on both my nodes and switch.

I believe there's also a way to aggregate both NICs: so that it falls to ethernet if Thunderbolt is busy or even do multi-path TCP if multiple IPs are reachable, improving throughput and redundancy. This is more applicable to your case as you have both ~10Gbps-capable streams.

I hope this helps. But if you already solved this, please do tell 🙏 .

Aug 23 '25 16:08 rkv0id

[BUG] Multi-node fine-tuning with thunderbolt

Describe the bug

To Reproduce

package version:

system info:

training script:

hostfile:

offload config:

LOG:(The program is stuck on the last line.)

Expected behavior