Uni-Fold icon indicating copy to clipboard operation
Uni-Fold copied to clipboard

Multi node training

Open drhicks opened this issue 1 year ago • 3 comments

I was able to successfully train multimer on a singe node gpu with multiple gpus, but I have been having trouble modifying the training example to train on multiple nodes. Would it be possible to provide an example for multi node training?

I'm not sure how to properly modify the torchrun command or unicore arguments.

drhicks avatar Aug 18 '23 20:08 drhicks

Can you please provide more info on the code you are using and the failure message? The code is expected to solve multi-node training with torch.distributional, so to me it seems like a configuration problem of your distributional framework.

ZiyaoLi avatar Aug 25 '23 07:08 ZiyaoLi

Thanks for the help. I am sure I am just doing something stupid here.

I submit to slurm like this (everything works with single node):

sbatch -p gpu-train --gres=gpu:l40:8 -N 2 -c4 -n8 -t 99:00:00 --wrap='source activate unifold; cd /home/drhicks1/Uni-Fold; bash train_multimer.sh /databases/unifold/ multimer_unifold_ft params/multimer.unifold.pt multimer'

The training script is the same as the examples, except for these changes: MASTER_IP=$(hostname -I | awk '{print $1}') OMPI_COMM_WORLD_SIZE=$SLURM_NNODES OMPI_COMM_WORLD_RANK=$SLURM_NODEID

code:

`[ -z "${MASTER_PORT}" ] && MASTER_PORT=10087 [ -z "${MASTER_IP}" ] && MASTER_IP=$(hostname -I | awk '{print $1}') [ -z "${n_gpu}" ] && n_gpu=$(nvidia-smi -L | wc -l) [ -z "${update_freq}" ] && update_freq=1 [ -z "${total_step}" ] && total_step=10000 [ -z "${warmup_step}" ] && warmup_step=500 [ -z "${decay_step}" ] && decay_step=10000 [ -z "${decay_ratio}" ] && decay_ratio=1.0 [ -z "${sd_prob}" ] && sd_prob=0.5 [ -z "${lr}" ] && lr=5e-4 [ -z "${seed}" ] && seed=31 [ -z "${OMPI_COMM_WORLD_SIZE}" ] && OMPI_COMM_WORLD_SIZE=$SLURM_NNODES [ -z "${OMPI_COMM_WORLD_RANK}" ] && OMPI_COMM_WORLD_RANK=$SLURM_NODEID

export NCCL_ASYNC_ERROR_HANDLING=1 export OMP_NUM_THREADS=1 echo "n_gpu per node" $n_gpu echo "OMPI_COMM_WORLD_SIZE" $OMPI_COMM_WORLD_SIZE echo "OMPI_COMM_WORLD_RANK" $OMPI_COMM_WORLD_RANK echo "MASTER_IP" $MASTER_IP echo "MASTER_PORT" $MASTER_PORT echo "data" $1 echo "save_dir" $2 echo "decay_step" $decay_step echo "warmup_step" $warmup_step echo "decay_ratio" $decay_ratio echo "lr" $lr echo "total_step" $total_step echo "update_freq" $update_freq echo "seed" $seed echo "data_folder:" ls $1 echo "create folder for save" mkdir -p $2 echo "start training"

OPTION="" if [ -f "$2/checkpoint_last.pt" ]; then echo "ckp exists." else echo "finetuning from inital training..." OPTION=" --finetune-from-model $3 --load-from-ema " fi model_name=$4

tmp_dir=mktemp -d

torchrun --nproc_per_node=$n_gpu --master_port $MASTER_PORT --nnodes=$OMPI_COMM_WORLD_SIZE --node_rank=$OMPI_COMM_WORLD_RANK --master_addr=$MASTER_IP
$(which unicore-train) $1 --user-dir unifold
--num-workers 4 --ddp-backend=no_c10d
--task af2 --loss afm --arch af2 --sd-prob $sd_prob
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --clip-norm 0.0 --per-sample-clip-norm 0.1 --allreduce-fp32-grad
--lr-scheduler exponential_decay --lr $lr --warmup-updates $warmup_step --decay-ratio $decay_ratio --decay-steps $decay_step --stair-decay --batch-size 1
--update-freq $update_freq --seed $seed --tensorboard-logdir $2/tsb/
--max-update $total_step --max-epoch 1 --log-interval 10 --log-format simple
--save-interval-updates 500 --validate-interval-updates 500 --keep-interval-updates 40 --no-epoch-checkpoints
--save-dir $2 --tmp-save-dir $tmp_dir --required-batch-size-multiple 1 --bf16 --ema-decay 0.999 --data-buffer-size 32 --bf16-sr --model-name $model_name $OPTION

rm -rf $tmp_dir`

Below is log output. It just hangs forever after this:

n_gpu per node 8 OMPI_COMM_WORLD_SIZE 2 OMPI_COMM_WORLD_RANK 0 MASTER_IP 172.16.130.196 MASTER_PORT 10087 data /databases/openfold/unifold/ save_dir multimer_unifold_ft3 decay_step 10000 warmup_step 500 decay_ratio 1.0 lr 5e-4 total_step 10000 update_freq 1 seed 31 data_folder: eval_multi_label.json eval_sample_weight.json pdb_assembly.json pdb_features pdb_labels pdb_uniprots sd_features sd_labels sd_train_sample_weight.json train_multi_label.json train_sample_weight.json create folder for save start training finetuning from inital training...

drhicks avatar Aug 26 '23 23:08 drhicks

I think the master addr/ip is not set properly, each node sets itself as the master.

see https://discuss.pytorch.org/t/distributed-training-on-slurm-cluster/150417/8

jozhang97 avatar Mar 19 '24 04:03 jozhang97