LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

在slurm上多机多卡进行sft处理数据集时preprocessing_num_workers无法大于1

Open bin123apple opened this issue 1 year ago • 0 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

求大佬们帮忙看下这个问题:

问题描述: 在slurm上多机多卡进行sft处理数据集时只要preprocessing_num_workers > 1, 程序就会卡在 /LLaMA-Factory/src/llmtuner/train/sft/workflow.py的这一步: dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft").

硬件设备: 8 nodes, 每个node 4张 80GB A100 .

训练任务: 全参数sft一个自定义扩充过词汇表的33B模型. 数据集大约100K

训练脚本: Slurm脚本:

#!/bin/bash
#SBATCH -N 8
#SBATCH -C gpu&hbm80g
#SBATCH -G 32
#SBATCH -q regular
#SBATCH -J model_training
#SBATCH --mail-type=ALL
#SBATCH -t 24:00:00

# Load conda  
echo "loading conda..."
module load conda 
conda activate llama_factory

# Huggingface Setting 
echo "Setting Huggingface..."
export HF_HOME=$SCRATCH/huggingface 
export HF_TOKEN=<HF/Token>

# OpenMP settings:
echo "Setting OMP..."
export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

# Set CFLAGS and LDFLAGS and CUTLASS 
export CFLAGS="-I/global/homes/b/bin123/.conda/envs/llama_factory/include $CFLAGS"
export LDFLAGS="-L/global/homes/b/bin123/.conda/envs/llama_factory/lib $LDFLAGS"
export CUTLASS_PATH=/global/homes/b/bin123/cutlass

# run the application:
# applications may perform better with --gpu-bind=none instead of --gpu-bind=single:1 
echo "Start to run the experiment..."
chmod +x /global/homes/b/bin123/LLaMA-Factory/
chmod +x /pscratch/sd/b/bin123/Trained_Models/
srun -n 8 -c 64 --cpu_bind=cores -G 32 --gpu-bind=none  /global/homes/b/bin123/LLaMA-Factory/examples/full_multi_gpu/multi_node.sh > /global/homes/b/bin123/model_training.log 2>&1 

全参数sft脚本:

#!/bin/bash
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=$SCRATCH/deepseeker_33b/full/sft/  # Try to keep all the output models out of the Github folder.
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

export APP_TCP_PORT_RANGE=60000,60064
export GLOBUS_TCP_PORT_RANGE=60000,60064
MASTER_PORT="60006"

# Set master address, GPU_per_node, nodes
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
NPROC_PER_NODE=4
NNODES=8
RANK=$SLURM_NODEID


echo "Running training script..."
echo "Output will be saved in: $OUTPUT"
echo "Using ZERO_STAGE: $ZERO_STAGE"
echo "Master address $MASTER_ADDR"
echo "Master port $MASTER_PORT"
echo "nproc_per_node $NPROC_PER_NODE"
echo "nnodes $NNODES"
echo "node_rank $RANK"

python -m torch.distributed.run \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $NNODES \
    --node_rank $RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    /global/homes/b/bin123/LLaMA-Factory/src/train_bash.py \
    --deepspeed /global/homes/b/bin123/LLaMA-Factory/examples/full_multi_gpu/ds_z3_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path /pscratch/sd/b/bin123/Trained_Models/deepseeker_interpreter_33b_with_st \
    --dataset Cust_dataset \
    --dataset_dir /global/homes/b/bin123/LLaMA-Factory/data \
    --template default \
    --finetuning_type full \
    --output_dir $OUTPUT \
    --overwrite_cache \
    --overwrite_output_dir \
    --use_fast_tokenizer true \
    --cutoff_len 5120 \
    --preprocessing_num_workers 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 500 \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --val_size 0.1 \
    --ddp_timeout 1800000 \
    --plot_loss \
    --fp16 \
    --resize_vocab true \
    > /global/homes/b/bin123/LLaMA-Factory/saves/deepseeker_33b/full/sft/training.log 2>&1

问题分析

  1. 已经尝试了很多不同的preprocessing_num_workers参数,每次--preprocessing_num_workers > 1时每次都会卡在加载完之后.

  2. 尝试在/LLaMA-Factory/src/llmtuner/data/loader.py中添加了很多print语句用于debug, 理论上来说每次所有print都会被打印4次,但是实际发现随着preprocessing_num_workers参数的变化,每次卡住的地方也不同 (通过不同位置的print次数来判断,有的print只打印了3次. 似乎随着preprocessing_num_workers参数增加,停止的位置越早。

  3. 在非cluster的server上运行顺利,没有出现问题。

  4. 现在主要怀疑是不是不同的workers之间出现了死锁,是不是slurm的运行设置哪个地方没写对.

谢谢帮助!

Expected behavior

No response

System Info

No response

Others

No response

bin123apple avatar May 02 '24 03:05 bin123apple