LLaMA-Factory
LLaMA-Factory copied to clipboard
Issues of LLaMA3 SFT on multi-nodes
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
When exec the following with Meta-Llama-3-8B
, it seems that deepspeed cannot be imported correctly on the child nodes.
MASTER_PORT=25001
NPROC_PER_NODE=$1
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "Configuration for distributed training:"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Node rank: $SLURM_PROCID"
python -m torch.distributed.run \
--nproc_per_node $NPROC_PER_NODE \
--nnodes $SLURM_JOB_NUM_NODES \
--node_rank $SLURM_PROCID \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
src/train_bash.py \
--deepspeed deepspeed3.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B \
--dataset Temp_ST1_sub \
--template default \
--streaming \
--finetuning_type full \
--output_dir saves/Temp111_ST1_lm38b_mn \
--overwrite_cache \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 200 \
--save_steps 400 \
--learning_rate 2.0e-5 \
--num_train_epochs 4.0 \
--max_steps 60000 \
--ddp_timeout 1800000 \
--plot_loss \
--bf16 \
--dispatch_batches False \
--ignore_data_skip
Error message snapshot:
line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 62, in <module>
import deepspeed
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
from ..git_version_info import compatible_ops as __compatible_ops__
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/git_version_info.py", line 29, in <module>
op_compatible = builder.is_compatible()
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 29, in is_compatible
sys_cuda_major, _ = installed_cuda_version()
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
Expected behavior
- This script works fine on single node, but yielding the above error when nnode>=2.
- This script works fine on other models like llama2 with nnode>=1.
System Info
-
transformers
version: 4.40.0 - Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.35
- Python version: 3.10.14
- Huggingface_hub version: 0.22.2
- Safetensors version: 0.4.3
- Accelerate version: 0.29.3
- PyTorch version (GPU?): 2.2.2+cu121 (True)
Name: deepspeed Version: 0.14.1
Others
No response