[BUG] Multi-node fine-tuning with thunderbolt
Describe the bug
I want to perform multi-node fine-tuning using deepspeed and Llama Factory. I offload my model to cpu-cpu stage3. Using LAN and Thunderbolt for SSH connections, the connection functions work normally. When I connected using LAN, the fine-tuning works normally, but when I switch to Thunderbolt, it gets stuck. Unfortunately, there are no error messages in the program. Has anyone encountered the same issue?
To Reproduce
package version:
- torch=2.2.2+cu121
- torchaudio=2.2.2+cu121
- torchvision=0.17.2+cu121
- deepspeed=0.13.4
- transformers=4.39.3
system info:
- system 1:
- MB : gigabyte z790 XTREME
- GPU: RTX 4090
- system 2:
- MB : gigabyte z790 XTREME
- GPU: RTX 4080
training script:
$gpus=1
$train_batch=1
deepspeed --hostfile=../../hostfile.txt \
--num_gpus $gpus ../../src/train_bash.py \
--deepspeed ds_z3_offload_cpu_cpu_config.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--dataset alpaca_gpt4_en \
--dataset_dir ../../data \
--template default \
--finetuning_type lora \
--output_dir ../../saves/llama3_lorafull/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len $seq_len \
--lr_scheduler_type cosine \
--logging_steps 10 \
--per_device_train_batch_size $train_batch \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--lora_target q_proj,v_proj \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--max_samples 3000 \
--val_size 0.1 \
--plot_loss \
--bf16 2>&1 | tee $filename
hostfile:
192.168.1.1 slots=1
192.168.1.2 slots=1
offload config:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"nvme_path": "/media/user/nvme0",
"pin_memory": true,
"ratio": 1,
"buffer_count": 5,
"fast_init": false
},
"offload_param": {
"device": "cpu",
"nvme_path": "/media/user/nvme0",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"round_robin_gradients": true,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
}
}
LOG:(The program is stuck on the last line.)
[2024-07-11 11:16:24,229] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 11:16:24,855] [INFO] [runner.py:463:main] Using IP address of 172.17.0.1 for node 192.168.1.1
[2024-07-11 11:16:24,856] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: 192.168.1.1,192.168.1.2
[2024-07-11 11:16:24,856] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w 192.168.1.1,192.168.1.2 export NCCL_P2P_DISABLE=1; export NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0; export NCCL_DEBUG=DEBUG; export NCCL_IB_DISABLE=1; export PYTHONPATH=/home/trx50/Project/factory/LLaMA-Factory/examples/full_multi_gpu; cd /home/trx50/Project/factory/LLaMA-Factory/examples/full_multi_gpu; /home/trx50/.virtualenvs/factory/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxOTIuMTY4LjEuMSI6IFswXSwgIjE5Mi4xNjguMS4yIjogWzBdfQ== --node_rank=%n --master_addr=172.17.0.1 --master_port=29500 ../../src/train_bash.py --deepspeed ds_z3_offload_cpu_cpu_config.json --stage sft --do_train --model_name_or_path meta-llama/Llama-2-7b-chat-hf --dataset alpaca_gpt4_en --dataset_dir ../../data --template default --finetuning_type lora --output_dir ../../saves/llama3_lorafull/sft --overwrite_cache --overwrite_output_dir --cutoff_len 2048 --lr_scheduler_type cosine --logging_steps 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --eval_steps 100 --evaluation_strategy steps --learning_rate 5e-5 --num_train_epochs 1.0 --lora_target q_proj,v_proj --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --max_samples 3000 --val_size 0.1 --plot_loss --bf16
192.168.1.1: [2024-07-11 11:16:25,980] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.2: [2024-05-18 06:57:44,090] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=DEBUG
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.1.1': [0], '192.168.1.2': [0]}
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.1.1': [0], '192.168.1.2': [1]})
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:163:main] dist_world_size=2
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
192.168.1.1: [2024-07-11 11:16:26,210] [INFO] [launch.py:253:main] process 22275 spawned with command: ['/home/trx50/.virtualenvs/factory/bin/python', '-u', '../../src/train_bash.py', '--local_rank=0', '--deepspeed', 'ds_z3_offload_cpu_cpu_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', 'meta-llama/Llama-2-7b-chat-hf', '--dataset', 'alpaca_gpt4_en', '--dataset_dir', '../../data', '--template', 'default', '--finetuning_type', 'lora', '--output_dir', '../../saves/llama3_lorafull/sft', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--lora_target', 'q_proj,v_proj', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--max_samples', '3000', '--val_size', '0.1', '--plot_loss', '--bf16']
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_P2P_DISABLE=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=thunderbolt0,thunderbolt0
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_DEBUG=DEBUG
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:138:main] 1 NCCL_IB_DISABLE=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.1.1': [0], '192.168.1.2': [0]}
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.1.1': [0], '192.168.1.2': [1]})
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:163:main] dist_world_size=2
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
192.168.1.2: [2024-05-18 06:57:44,366] [INFO] [launch.py:253:main] process 11393 spawned with command: ['/home/trx50/.virtualenvs/factory/bin/python', '-u', '../../src/train_bash.py', '--local_rank=0', '--deepspeed', 'ds_z3_offload_cpu_cpu_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', 'meta-llama/Llama-2-7b-chat-hf', '--dataset', 'alpaca_gpt4_en', '--dataset_dir', '../../data', '--template', 'default', '--finetuning_type', 'lora', '--output_dir', '../../saves/llama3_lorafull/sft', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--lora_target', 'q_proj,v_proj', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--max_samples', '3000', '--val_size', '0.1', '--plot_loss', '--bf16']
192.168.1.2: [2024-05-18 06:57:45,536] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:27,773] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.1.1: [2024-07-11 11:16:28,240] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.1.1: [2024-07-11 11:16:28,241] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
192.168.1.2: [2024-05-18 06:57:46,373] [INFO] [comm.py:637:init_distributed] cdb=None
Expected behavior
Execute the training normally and produce results.
Hi @Raywang0211 - that's very interesting, but the lack of error messages will make this hard to debug. Are you able to confirm ssh connections work fine normally/without DeepSpeed?
Hi @loadams : I just found out that the main problem of this issue is the motherboard. I changed another motherboard and the problem was solved. But i found another interesting issue, there was some strange result during multi-node training by 10G Lan or 40G thunderbolt. I conducted two experiments. I tested transfer speed using 10G lan and 40G thunderbolt by iperf. The results like below:
| Specifications transfer speed(bps) | iperf3(bps) | |
|---|---|---|
| thunderbolt | 40G | 16.2G |
| lan | 10G | 9.37G |
I tested multi-node training using LAN and Thunderbolt, and the results are as follows:
| connect | batch_size | total_time | token/sec |
|---|---|---|---|
| 10G lan | 18 | 148.37 | 3533.6 |
| 40G thunderbolt | 18 | 168 | 3124.48 |
I have two questions:
- The results seem normal on 10G LAN, but they are strange on 40G Thunderbolt.
- Why is the transfer speed of Thunderbolt faster, but the final token/sec is worse than LAN? Both 10G LAN and 40G Thunderbolt's tx/rx are about 1.01G.
Hi all 👋 I came to this thread while also trying to setup a small cluster for distributed training and wanted to use thunderbolt over ethernet because of the potential theoretical speed gains.
From what I learned, the speed is limited because of a driver limit. thunderbolt_net currently exposes the Thunderbolt link as a virtual 10 GbE interface, not as raw PCIe. This means the max TCP bandwidth is roughly what 10 GbE can do — ~12-16 Gbps is normal here. So even with USB4 physically capable of 40 Gbps, the Linux Ethernet driver cannot fully saturate it.
There is however a theoretical way one can improve this using vfio to expose the GPU from host B directly to host A so that from NCCL's perspective, it's reachable over P2P/PCIe, not TCP. Of course, this would mean that you'll gain a second gpu on host A and NCCL sees it as a second local GPU so you'll lose the distributed sense from the setup (you won't be able to offload to host B's RAM or NVMes or use its processing resources, etc)
I still didn't try it out and was looking around to see if someone successfully did it.
Another theoretical and experimental way to go is the hybrid way. Meaning setting up the USB4/Thunderbolt port as a PCIe tunnel for the host B's GPU on host A (that way you have a node with 2 gpus instead of one) and use another TCP link between the nodes (in your case, the 10Gbps ethernet - which is already good by the way) for the multinode setup and potential gains in memory (for offloading) and compute.
Unfortunately I can't confirm or try this on my setup cause I'm limited by the 2.5Gbps ports on both my nodes and switch.
I believe there's also a way to aggregate both NICs: so that it falls to ethernet if Thunderbolt is busy or even do multi-path TCP if multiple IPs are reachable, improving throughput and redundancy. This is more applicable to your case as you have both ~10Gbps-capable streams.
I hope this helps. But if you already solved this, please do tell 🙏 .