sd-scripts
sd-scripts copied to clipboard
Fix multi-node environment training and accelerator related codes + skip file check option
The Accelerator setup / etc was using loop, with explicit local process index check instead of process index check, resulting multi-node training hang forever.
After struggling against code for weeks, the slurm batch script works for multi node, at least for sdxl train_network and sdxl train (finetune).
#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --partition=<PARTITION>
#SBATCH --nodes=3 # number of nodes
#SBATCH --gres=gpu:4 # number of GPUs per node
#SBATCH --time=72:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --cpus-per-gpu=16
#SBATCH --qos=<QOS_NAME>
######################
### Set enviroment ###
######################
# Activate your Python environment
conda init
conda activate kohya
unset LD_LIBRARY_PATH
# Change to the directory containing your script
cd ~/large_train/sd-scripts
gpu_count=$(scontrol show job $SLURM_JOB_ID | grep -oP 'TRES=.*?gpu=\K(\d+)' | head -1)
######################
#### Set network #####
######################
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
PORT=29508 # set this to unused port
######################
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1 # Disable this for general multi-node setup
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
# export TORCH_DISTRIBUTED_DEBUG=INFO
#######################
export NCCL_ASYNC_ERROR_HANDLING=0
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1
#######################
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
export SCRIPT="~/large_train/sd-scripts/sdxl_train.py "
export SCRIPT_ARGS=" \
--config_file ~/train_config_a6000_multinode.toml"
# for each nodes, set machine_rank int and launch
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
export RANK=0
export LOCAL_RANK=0
export WORLD_SIZE=$gpu_count
export MASTER_ADDR=$head_node_ip
export MASTER_PORT=$PORT
export NODE_RANK=$(scontrol show hostnames $SLURM_JOB_NODELIST | grep -oP "$node" | wc -l)
export LAUNCHER="/home/usr/miniconda3/envs/kohya/bin/accelerate launch \
--num_processes $gpu_count \
--num_machines $SLURM_NNODES \
--rdzv_backend c10d \
--main_process_ip $head_node_ip \
--main_process_port $PORT \
--machine_rank $NODE_RANK"
echo "node: $node, rank: $RANK, local_rank: $LOCAL_RANK, world_size: $WORLD_SIZE, master_addr: $MASTER_ADDR, master_port: $MASTER_PORT, node_rank: $NODE_RANK"
CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
srun --nodes=1 --ntasks=1 --ntasks-per-node=1 $CMD &
done
wait
Success log:
2024-04-08 16:48:47 INFO Accelerator prepared at cuda:1 / sdxl_train.py:203
process index : 4, local process
index : 1
INFO Waiting for everyone / sdxl_train.py:204
他のプロセスを待機中
2024-04-08 16:48:48 INFO Accelerator prepared at cuda:0 / sdxl_train.py:203
process index : 4, local process
index : 0
INFO Waiting for everyone / sdxl_train.py:204
他のプロセスを待機中
...
2024-04-08 16:48:50 INFO All processes are ready / sdxl_train.py:206
すべてのプロセスが準備完了
INFO loading model for process 1 3 sdxl_train_util.py:28
/4
...
2024-04-08 16:49:01 INFO model loaded for all sdxl_train_util.py:56
processes 0 2 /4
steps: 0%| | 1/643744 [02:04<22266:39:32, 124.52s/it]
steps: 0%| | 1/643744 [02:04<22266:50:43, 124.52s/it, avr_loss=0.0703]
steps: 0%| | 2/643744 [02:15<12092:54:30, 67.63s/it, avr_loss=0.0703]
steps: 0%| | 2/643744 [02:15<12092:58:25, 67.63s/it, avr_loss=0.0918]
steps: 0%| | 3/643744 [02:25<8695:59:50, 48.63s/it, avr_loss=0.0918]
steps: 0%| | 3/643744 [02:25<8696:01:52, 48.63s/it, avr_loss=0.0919]
....
Also, the skip_file_existence_check = true option is added, to skip verify process in training start.
This can be only enabled if all files are usable, since it passes os.path.exists() process for all files.
Thank you for this! I didn't use multi node training, but this seems to be good.
Thank you for this! I didn't use multi node training, but this seems to be good.
Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training it's even more related than previously.
My dev branch (a but outdated) with these updates is here: https://github.com/kohya-ss/sd-scripts/compare/dev...evkogs:sd-scripts:dev
I didn't try to save training state with this PR, maybe #1340 is required as well.
I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.
Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training it's even more related than previously.
My dev branch (a but outdated) with these updates is here: https://github.com/kohya-ss/sd-scripts/compare/dev...evkogs:sd-scripts:dev
I didn't try to save training state with this PR, maybe #1340 is required as well.
I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.