ControlNet icon indicating copy to clipboard operation
ControlNet copied to clipboard

Stuck during multi-machine multi-GPU training

Open qingfengmingyue opened this issue 2 years ago • 4 comments

Stuck during multi-machine multi-GPU training

initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16

Can you help analyze what might be the reason?Thank @lllyasviel

qingfengmingyue avatar Mar 28 '23 13:03 qingfengmingyue

+1. it's repeatedly outputting this for me and never goes to start training:

Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/epoch=15-step=42351.ckpt]

andreemic avatar Apr 04 '23 11:04 andreemic

Hi, did you solve the multi gpu training problem? I can train on several GPU but the synthesis image doesn't follow the condition. But it works if I set the gpu number to be 1 with similar training epochs. Have you met this before?

ariannaliu avatar Apr 14 '23 16:04 ariannaliu

Very weird.. sadly my problem persists.

andreemic avatar Apr 15 '23 12:04 andreemic

Hi, did you solve the multi gpu training problem? I can train on several GPU but the synthesis image doesn't follow the condition. But it works if I set the gpu number to be 1 with similar training epochs. Have you met this before?

Same here, the tutorial_train.py is working with batch size =1 & GPU=1, but not during multi GPU setting .

curious about the training time of your experiment; how many epochs and how much time it took per epoch.

shravankumar147 avatar May 03 '23 16:05 shravankumar147

HuggingFace Diffusers ControlNet training script https://huggingface.co/docs/diffusers/training/controlnet has different performance optimizations builtin

geroldmeisinger avatar Sep 17 '23 08:09 geroldmeisinger

all duplicates concerning "Multi GPU" https://github.com/lllyasviel/ControlNet/issues/148 https://github.com/lllyasviel/ControlNet/issues/314 https://github.com/lllyasviel/ControlNet/issues/319 https://github.com/lllyasviel/ControlNet/issues/507

geroldmeisinger avatar Sep 17 '23 10:09 geroldmeisinger