sd-scripts Fix multi-node environment training and accelerator related codes + skip file check option

The Accelerator setup / etc was using loop, with explicit local process index check instead of process index check, resulting multi-node training hang forever.

After struggling against code for weeks, the slurm batch script works for multi node, at least for sdxl train_network and sdxl train (finetune).

#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --partition=<PARTITION>
#SBATCH --nodes=3                   # number of nodes
#SBATCH --gres=gpu:4              # number of GPUs per node
#SBATCH --time=72:00:00             # maximum execution time (HH:MM:SS)
#SBATCH --cpus-per-gpu=16
#SBATCH --qos=<QOS_NAME>


######################
### Set enviroment ###
######################
# Activate your Python environment
conda init
conda activate kohya
unset LD_LIBRARY_PATH
# Change to the directory containing your script
cd ~/large_train/sd-scripts
gpu_count=$(scontrol show job $SLURM_JOB_ID | grep -oP 'TRES=.*?gpu=\K(\d+)' | head -1)
######################
#### Set network #####
######################
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
PORT=29508 # set this to unused port
######################
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1 # Disable this for general multi-node setup
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
# export TORCH_DISTRIBUTED_DEBUG=INFO
#######################
export NCCL_ASYNC_ERROR_HANDLING=0
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1
#######################
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')

export SCRIPT="~/large_train/sd-scripts/sdxl_train.py "
export SCRIPT_ARGS=" \
    --config_file ~/train_config_a6000_multinode.toml"

# for each nodes, set machine_rank int and launch
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
    export RANK=0
    export LOCAL_RANK=0
    export WORLD_SIZE=$gpu_count
    export MASTER_ADDR=$head_node_ip
    export MASTER_PORT=$PORT
    export NODE_RANK=$(scontrol show hostnames $SLURM_JOB_NODELIST | grep -oP "$node" | wc -l)
    export LAUNCHER="/home/usr/miniconda3/envs/kohya/bin/accelerate launch \
        --num_processes $gpu_count \
        --num_machines $SLURM_NNODES \
        --rdzv_backend c10d \
        --main_process_ip $head_node_ip \
        --main_process_port $PORT \
        --machine_rank $NODE_RANK"
    echo "node: $node, rank: $RANK, local_rank: $LOCAL_RANK, world_size: $WORLD_SIZE, master_addr: $MASTER_ADDR, master_port: $MASTER_PORT, node_rank: $NODE_RANK"
    CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
    srun --nodes=1 --ntasks=1 --ntasks-per-node=1 $CMD &
done

wait

Success log:

2024-04-08 16:48:47 INFO     Accelerator prepared at cuda:1 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 1                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中    
2024-04-08 16:48:48 INFO     Accelerator prepared at cuda:0 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 0                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中     
...

2024-04-08 16:48:50 INFO     All processes are ready /         sdxl_train.py:206
                             すべてのプロセスが準備完了  

                    INFO     loading model for process 1 3 sdxl_train_util.py:28
                             /4      
...

2024-04-08 16:49:01 INFO     model loaded for all          sdxl_train_util.py:56
                             processes 0 2 /4  

steps:   0%|          | 1/643744 [02:04<22266:39:32, 124.52s/it]
steps:   0%|          | 1/643744 [02:04<22266:50:43, 124.52s/it, avr_loss=0.0703]
steps:   0%|          | 2/643744 [02:15<12092:54:30, 67.63s/it, avr_loss=0.0703] 
steps:   0%|          | 2/643744 [02:15<12092:58:25, 67.63s/it, avr_loss=0.0918]
steps:   0%|          | 3/643744 [02:25<8695:59:50, 48.63s/it, avr_loss=0.0918] 
steps:   0%|          | 3/643744 [02:25<8696:01:52, 48.63s/it, avr_loss=0.0919]

....

Also, the skip_file_existence_check = true option is added, to skip verify process in training start.

This can be only enabled if all files are usable, since it passes os.path.exists() process for all files.

Apr 08 '24 08:04 aria1th

Thank you for this! I didn't use multi node training, but this seems to be good.

Apr 08 '24 22:04 kohya-ss

Thank you for this! I didn't use multi node training, but this seems to be good.

Apr 08 '24 22:04 kohya-ss

Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training it's even more related than previously.

My dev branch (a but outdated) with these updates is here: https://github.com/kohya-ss/sd-scripts/compare/dev...evkogs:sd-scripts:dev

I didn't try to save training state with this PR, maybe #1340 is required as well.

I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.

Aug 11 '24 09:08 evkogs

Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training it's even more related than previously.

My dev branch (a but outdated) with these updates is here: https://github.com/kohya-ss/sd-scripts/compare/dev...evkogs:sd-scripts:dev

I didn't try to save training state with this PR, maybe #1340 is required as well.

I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.

Aug 11 '24 09:08 evkogs

sd-scripts sd-scripts copied to clipboard

Fix multi-node environment training and accelerator related codes + skip file check option

sd-scripts
sd-scripts copied to clipboard