在slurm上多机多卡进行sft处理数据集时preprocessing_num_workers无法大于1
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
求大佬们帮忙看下这个问题:
问题描述:
在slurm上多机多卡进行sft处理数据集时只要preprocessing_num_workers > 1, 程序就会卡在 /LLaMA-Factory/src/llmtuner/train/sft/workflow.py的这一步: dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft").
硬件设备: 8 nodes, 每个node 4张 80GB A100 .
训练任务: 全参数sft一个自定义扩充过词汇表的33B模型. 数据集大约100K
训练脚本: Slurm脚本:
#!/bin/bash
#SBATCH -N 8
#SBATCH -C gpu&hbm80g
#SBATCH -G 32
#SBATCH -q regular
#SBATCH -J model_training
#SBATCH --mail-type=ALL
#SBATCH -t 24:00:00
# Load conda
echo "loading conda..."
module load conda
conda activate llama_factory
# Huggingface Setting
echo "Setting Huggingface..."
export HF_HOME=$SCRATCH/huggingface
export HF_TOKEN=<HF/Token>
# OpenMP settings:
echo "Setting OMP..."
export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
# Set CFLAGS and LDFLAGS and CUTLASS
export CFLAGS="-I/global/homes/b/bin123/.conda/envs/llama_factory/include $CFLAGS"
export LDFLAGS="-L/global/homes/b/bin123/.conda/envs/llama_factory/lib $LDFLAGS"
export CUTLASS_PATH=/global/homes/b/bin123/cutlass
# run the application:
# applications may perform better with --gpu-bind=none instead of --gpu-bind=single:1
echo "Start to run the experiment..."
chmod +x /global/homes/b/bin123/LLaMA-Factory/
chmod +x /pscratch/sd/b/bin123/Trained_Models/
srun -n 8 -c 64 --cpu_bind=cores -G 32 --gpu-bind=none /global/homes/b/bin123/LLaMA-Factory/examples/full_multi_gpu/multi_node.sh > /global/homes/b/bin123/model_training.log 2>&1
全参数sft脚本:
#!/bin/bash
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
OUTPUT=$SCRATCH/deepseeker_33b/full/sft/ # Try to keep all the output models out of the Github folder.
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=3
fi
mkdir -p $OUTPUT
export APP_TCP_PORT_RANGE=60000,60064
export GLOBUS_TCP_PORT_RANGE=60000,60064
MASTER_PORT="60006"
# Set master address, GPU_per_node, nodes
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
NPROC_PER_NODE=4
NNODES=8
RANK=$SLURM_NODEID
echo "Running training script..."
echo "Output will be saved in: $OUTPUT"
echo "Using ZERO_STAGE: $ZERO_STAGE"
echo "Master address $MASTER_ADDR"
echo "Master port $MASTER_PORT"
echo "nproc_per_node $NPROC_PER_NODE"
echo "nnodes $NNODES"
echo "node_rank $RANK"
python -m torch.distributed.run \
--nproc_per_node $NPROC_PER_NODE \
--nnodes $NNODES \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
/global/homes/b/bin123/LLaMA-Factory/src/train_bash.py \
--deepspeed /global/homes/b/bin123/LLaMA-Factory/examples/full_multi_gpu/ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path /pscratch/sd/b/bin123/Trained_Models/deepseeker_interpreter_33b_with_st \
--dataset Cust_dataset \
--dataset_dir /global/homes/b/bin123/LLaMA-Factory/data \
--template default \
--finetuning_type full \
--output_dir $OUTPUT \
--overwrite_cache \
--overwrite_output_dir \
--use_fast_tokenizer true \
--cutoff_len 5120 \
--preprocessing_num_workers 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 500 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--val_size 0.1 \
--ddp_timeout 1800000 \
--plot_loss \
--fp16 \
--resize_vocab true \
> /global/homes/b/bin123/LLaMA-Factory/saves/deepseeker_33b/full/sft/training.log 2>&1
问题分析
-
已经尝试了很多不同的preprocessing_num_workers参数,每次--preprocessing_num_workers > 1时每次都会卡在加载完之后.
-
尝试在
/LLaMA-Factory/src/llmtuner/data/loader.py中添加了很多print语句用于debug, 理论上来说每次所有print都会被打印4次,但是实际发现随着preprocessing_num_workers参数的变化,每次卡住的地方也不同 (通过不同位置的print次数来判断,有的print只打印了3次. 似乎随着preprocessing_num_workers参数增加,停止的位置越早。 -
在非cluster的server上运行顺利,没有出现问题。
-
现在主要怀疑是不是不同的workers之间出现了死锁,是不是slurm的运行设置哪个地方没写对.
谢谢帮助!
Expected behavior
No response
System Info
No response
Others
No response