ControlNet Training on multi GPUs are always OOM while it working well on single GPU

I used 4 Tesla T10 16GB GPU to train my model, but it's wired that it can only run on the mode of gpus=1. My batch_size=4 and I can run it on 1 GPU, but once I changed my gpus>1 it will be OOM, no matter DDP mode or DP mode to use. I'm very confused and want to konw that the reason is come from pytorch lightning framework or model construction?

Mar 08 '23 03:03 Zack14159

I have the same problem, and I can't run on muti gpus because the images run on different opus but the prompt runs on the single gpu. Do you have any idea?

Mar 13 '23 09:03 yzy-bupt

If you want to train it on only single GPU, maybe you can try this:

to ensure your device. But I still don't know how to train it on multi GPUs, can you tell me how you make your images run on different gpus?

------------------ 原始邮件 ------------------ 发件人: "lllyasviel/ControlNet" @.>; 发送时间: 2023年3月13日(星期一) 下午5:14 @.>; @.@.>; 主题: Re: [lllyasviel/ControlNet] Training on multi GPUs are always OOM while it working well on single GPU (Issue #242)

I have the same problem, and I can't run on muti gpus because the images run on different opus but the prompt runs on the single gpu. Do you have any idea?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Mar 13 '23 09:03 Zack14159

You can try setting the strategy='dp', then you can run muti opus, but you may get the same issue as mine.

Mar 13 '23 09:03 yzy-bupt

I have tried this strategy, but it's stll OOM

------------------ 原始邮件 ------------------ 发件人: "lllyasviel/ControlNet" @.>; 发送时间: 2023年3月13日(星期一) 下午5:32 @.>; @.@.>; 主题: Re: [lllyasviel/ControlNet] Training on multi GPUs are always OOM while it working well on single GPU (Issue #242)

You can try setting the strategy='dp', then you can run muti opus, but you may get the same issue as mine.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Mar 13 '23 11:03 Zack14159

try fsdp or deepspeed

Mar 14 '23 10:03 SwayStar123

try fsdp or deepspeed

I tried it and got

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Mar 23 '23 16:03 yan-xu-helm

16 GB is still limited as far as I concern.

Jul 03 '23 02:07 NorthSummer

Thanks for everyone, I've resolved the peroblem by using 4 P40 24G. 16GB is not enough for multi GPUs training.

Jul 03 '23 02:07 Zack14159

ControlNet ControlNet copied to clipboard

Training on multi GPUs are always OOM while it working well on single GPU

ControlNet
ControlNet copied to clipboard