improved-diffusion icon indicating copy to clipboard operation
improved-diffusion copied to clipboard

Can’t Continue the training with the checkpoint in distributed manner !!!

Open 666wodeyy opened this issue 1 year ago • 1 comments

My dataset consists of 8 thousand grayscale images of 256 * 256 size,the follow is my train script:

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"

DIFFUSION_FLAGS="--diffusion_steps 1000 \
                --noise_schedule cosine \
                --use_kl True"

TRAIN_FLAGS="--lr 1e-4 --batch_size 8"
export OPENAI_LOGDIR=XXXX

NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8

MASTER_PORT=$(python -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(('',0)); print(s.getsockname()[1]); s.close()")
export MASTER_ADDR=localhost
export MASTER_PORT=$MASTER_PORT  

NUM_GPUS="2"
mpiexec -n $NUM_GPUS python image_train.py --data_dir ./data/XXX $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS --resume_checkpoint ./training_log/CREMI/model039000.pt

Strangely, when I do not specify checkpoint (i. e., without the resume_checkpoint command), the model can run normally on two V100s, but when I try to join checkpoint to continue training, the model makes an error

image

666wodeyy avatar Sep 30 '24 07:09 666wodeyy

I have the same issue. Has there been a solution yet?

muworld avatar Dec 10 '24 13:12 muworld

I have the same issue,confuse

WenHe822 avatar May 16 '25 04:05 WenHe822

请问您解决了吗

bye111 avatar Jun 19 '25 15:06 bye111