ControlNet
ControlNet copied to clipboard
Stuck during multi-machine multi-GPU training
Stuck during multi-machine multi-GPU training
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
Can you help analyze what might be the reason?Thank @lllyasviel
+1. it's repeatedly outputting this for me and never goes to start training:
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/epoch=15-step=42351.ckpt]
Hi, did you solve the multi gpu training problem? I can train on several GPU but the synthesis image doesn't follow the condition. But it works if I set the gpu number to be 1 with similar training epochs. Have you met this before?
Very weird.. sadly my problem persists.
Hi, did you solve the multi gpu training problem? I can train on several GPU but the synthesis image doesn't follow the condition. But it works if I set the gpu number to be 1 with similar training epochs. Have you met this before?
Same here, the tutorial_train.py is working with batch size =1 & GPU=1, but not during multi GPU setting .
curious about the training time of your experiment; how many epochs and how much time it took per epoch.
HuggingFace Diffusers ControlNet training script https://huggingface.co/docs/diffusers/training/controlnet has different performance optimizations builtin
all duplicates concerning "Multi GPU" https://github.com/lllyasviel/ControlNet/issues/148 https://github.com/lllyasviel/ControlNet/issues/314 https://github.com/lllyasviel/ControlNet/issues/319 https://github.com/lllyasviel/ControlNet/issues/507