improved-diffusion
improved-diffusion copied to clipboard
Can’t Continue the training with the checkpoint in distributed manner !!!
My dataset consists of 8 thousand grayscale images of 256 * 256 size,the follow is my train script:
MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000 \
--noise_schedule cosine \
--use_kl True"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"
export OPENAI_LOGDIR=XXXX
NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8
MASTER_PORT=$(python -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(('',0)); print(s.getsockname()[1]); s.close()")
export MASTER_ADDR=localhost
export MASTER_PORT=$MASTER_PORT
NUM_GPUS="2"
mpiexec -n $NUM_GPUS python image_train.py --data_dir ./data/XXX $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS --resume_checkpoint ./training_log/CREMI/model039000.pt
Strangely, when I do not specify checkpoint (i. e., without the resume_checkpoint command), the model can run normally on two V100s, but when I try to join checkpoint to continue training, the model makes an error
I have the same issue. Has there been a solution yet?
I have the same issue,confuse
请问您解决了吗