ControlNet Why `pl.Trainer` can not handle multi-gpu case?

Why `pl.Trainer` can not handle multi-gpu case?

Open doem97 opened this issue 2 years ago • 4 comments

I can run the original tutorial_train.py with single 3090Ti GPU (24G) with batch_size 3.

However, when upgrade to 2 or more gpus, it keep warning OOM.

trainer = pl.Trainer(gpus=2 precision=32, callbacks=[logger])

I am curious why? Why single GPU can handle batch 3 while multi-GPU can only handle 1?? The GPUS hold batches on their own parallelly, am I right?

Mar 30 '23 00:03 doem97

because one gpu need to compute "gradient = (gradient_from_gpu_1 + gradient_from_gpu_2) / 2" This computation will take many vram.

Mar 30 '23 02:03 lllyasviel

gradient_from_gpu_1 + gradient_from_gpu_2

Thanks @lllyasviel !

So basically bottleneck is the one holding gradient averaging, while remaining should work fine, e.g., GPU0 require 24G+24G; GPU1 require 24G; GPU2 require 24G; GPU3 require 24G.

THUS, we should ensure GPU0 some space, e.g., work with GPU0 12G+12G, GPU1 12G, GPU2 12G, GPU3 12G.

Mar 30 '23 05:03 doem97

Sry I got question again.

I travel from recognition community. In recognition, normally the multi-GPU training won't result in significant different RAMs among GPUs. Does this "1-big-gpu" thing only happen in stable diffusion/control net?

Mar 30 '23 05:03 doem97

use fsdp or deepspeed training strategy

Apr 01 '23 15:04 SwayStar123

HuggingFace Diffusers ControlNet training script https://huggingface.co/docs/diffusers/training/controlnet has different optimizations builtin

Sep 17 '23 08:09 geroldmeisinger

all duplicates concerning "Multi GPU" https://github.com/lllyasviel/ControlNet/issues/148 https://github.com/lllyasviel/ControlNet/issues/314 https://github.com/lllyasviel/ControlNet/issues/319 https://github.com/lllyasviel/ControlNet/issues/507

Sep 17 '23 10:09 geroldmeisinger

ControlNet ControlNet copied to clipboard

Why `pl.Trainer` can not handle multi-gpu case?

ControlNet
ControlNet copied to clipboard