LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Issues of LLaMA3 SFT on multi-nodes

Open Liusifei opened this issue 10 months ago • 0 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

When exec the following with Meta-Llama-3-8B, it seems that deepspeed cannot be imported correctly on the child nodes.

MASTER_PORT=25001
NPROC_PER_NODE=$1
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

echo "Configuration for distributed training:"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Node rank: $SLURM_PROCID"

python -m torch.distributed.run \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $SLURM_JOB_NUM_NODES \
    --node_rank $SLURM_PROCID \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    src/train_bash.py \
    --deepspeed deepspeed3.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --dataset Temp_ST1_sub \
    --template default \
    --streaming \
    --finetuning_type full \
    --output_dir saves/Temp111_ST1_lm38b_mn \
    --overwrite_cache \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 200 \
    --save_steps 400 \
    --learning_rate 2.0e-5 \
    --num_train_epochs 4.0 \
    --max_steps 60000 \
    --ddp_timeout 1800000 \
    --plot_loss \
    --bf16 \
    --dispatch_batches False \
    --ignore_data_skip

Error message snapshot:

line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 62, in <module>
    import deepspeed
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 29, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
    raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

Expected behavior

  1. This script works fine on single node, but yielding the above error when nnode>=2.
  2. This script works fine on other models like llama2 with nnode>=1.

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • PyTorch version (GPU?): 2.2.2+cu121 (True)

Name: deepspeed Version: 0.14.1

Others

No response

Liusifei avatar Apr 22 '24 19:04 Liusifei