ControlNet icon indicating copy to clipboard operation
ControlNet copied to clipboard

OOM on 24gb GPU (4090) when running training tutorial

Open whydna opened this issue 2 years ago • 9 comments

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 23.64 GiB total capacity; 15.74 GiB already allocated; 1.41 GiB free; 19.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Trying to run the training tutorial out of the box - how much VRAM is needed?

whydna avatar Mar 21 '23 14:03 whydna

Based on my experiments, I think you need at least 29G for running the tutorial

jingyangcarl avatar Mar 25 '23 01:03 jingyangcarl

set save_memory= true will support training on 16g vram

lllyasviel avatar Mar 25 '23 05:03 lllyasviel

could you tell me where to set

rcc-cubAC avatar Mar 26 '23 05:03 rcc-cubAC

config.py line:1

dzhcool avatar Mar 27 '23 09:03 dzhcool

Based on my experiments, I think you need at least 29G for running the tutorial

Thanks, This is something I am looking for.

Is there a way to compute GPU requirements for a given dataset and experiment?

shravankumar147 avatar May 01 '23 14:05 shravankumar147

could you tell me where to set

https://github.com/lllyasviel/ControlNet/blob/d3284fcd0972c510635a4f5abe2eeb71dc0de524/config.py#L1

shravankumar147 avatar May 01 '23 14:05 shravankumar147

set save_memory= true will support training on 16g vram

Is it possible to train controlnet with 11gb vram? @lllyasviel

universewill avatar May 03 '23 08:05 universewill

set save_memory= true will support training on 16g vram

Is it possible to train controlnet with 11gb vram? @lllyasviel

I also have 11gb varm, and set the size of training data at (128, 128), but still not work.... will this be helpful ? or just because I didn't set the size correctly?

at tutorial_dataset.py

def __getitem__(self, idx):
    item = self.data[idx]

    source_filename = item['source']
    target_filename = item['target']
    prompt = item['prompt']

    source = cv2.imread('./training/fill50k/' + source_filename)
    target = cv2.imread('./training/fill50k/' + target_filename)

    source = cv2.resize(source, (128, 128))   # !!!!!!!!!
    target = cv2.resize(target, (128, 128))     # !!!!!!!!!

    # Do not forget that OpenCV read images in BGR order.
    source = cv2.cvtColor(source, cv2.COLOR_BGR2RGB)
    target = cv2.cvtColor(target, cv2.COLOR_BGR2RGB)

    # Normalize source images to [0, 1].
    source = source.astype(np.float32) / 255.0

    # Normalize target images to [-1, 1].
    target = (target.astype(np.float32) / 127.5) - 1.0

    return dict(jpg=target, txt=prompt, hint=source)

YuanSnowing avatar May 22 '23 12:05 YuanSnowing

set save_memory= true will support training on 16g vram

Is it possible to train controlnet with 11gb vram? @lllyasviel

I also have 11gb varm, and set the size of training data at (128, 128), but still not work.... will this be helpful ? or just because I didn't set the size correctly?

at tutorial_dataset.py

when I continue to lower the size, I got an error :

Traceback (most recent call last):
  ........
  File "/kitti_gen/ControlNet/cldm/cldm.py", line 39, in forward
    h = torch.cat([h, hs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list.

I wonder why would this happen?.. maybe it is because some sampling inconsistency...? making shapes not the same

even if I set the resolution at 64, save_memory=True, and only use middle layer, this model still end to OOM. is there anything else I can do?

YuanSnowing avatar May 22 '23 13:05 YuanSnowing

all duplicate concerning "RAM and out of memory exceptions (OOM)": https://github.com/lllyasviel/ControlNet/issues/21 https://github.com/lllyasviel/ControlNet/issues/33 https://github.com/lllyasviel/ControlNet/issues/191 https://github.com/lllyasviel/ControlNet/issues/236 https://github.com/lllyasviel/ControlNet/issues/241 https://github.com/lllyasviel/ControlNet/issues/247 https://github.com/lllyasviel/ControlNet/issues/294 https://github.com/lllyasviel/ControlNet/issues/301

geroldmeisinger avatar Sep 17 '23 10:09 geroldmeisinger

@geroldmeisinger have you found a solution to this? Because as I mentioned elsewhere, commenting that there are duplicates just ends up ending a thread

codeundercoverdev avatar Nov 07 '23 20:11 codeundercoverdev

my hope with pointing to the duplicates was to help others find every information which is available on one topic and at the same time focus everything on one "main"-thread. on the other hand, this is a issue section, not discussions, and there should only be one thread per issue.

have you found a solution to this?

you can try the diffusers training script which claims to run on 8GB using Linux and deepspeed (scroll all the way down). someone also asked for ControlNet-XS support and I also asked for ControlNet Würstchen support which may reduce training requirements, but so far this hasn't been implemented. if you know of any other "small" controlnets, please let us know!

geroldmeisinger avatar Nov 08 '23 07:11 geroldmeisinger