ControlNet
ControlNet copied to clipboard
Training on multi GPUs are always OOM while it working well on single GPU
I used 4 Tesla T10 16GB GPU to train my model, but it's wired that it can only run on the mode of gpus=1. My batch_size=4 and I can run it on 1 GPU, but once I changed my gpus>1 it will be OOM, no matter DDP mode or DP mode to use. I'm very confused and want to konw that the reason is come from pytorch lightning framework or model construction?
I have the same problem, and I can't run on muti gpus because the images run on different opus but the prompt runs on the single gpu. Do you have any idea?
If you want to train it on only single GPU, maybe you can try this:
to ensure your device. But I still don't know how to train it on multi GPUs, can you tell me how you make your images run on different gpus?
------------------ 原始邮件 ------------------ 发件人: "lllyasviel/ControlNet" @.>; 发送时间: 2023年3月13日(星期一) 下午5:14 @.>; @.@.>; 主题: Re: [lllyasviel/ControlNet] Training on multi GPUs are always OOM while it working well on single GPU (Issue #242)
I have the same problem, and I can't run on muti gpus because the images run on different opus but the prompt runs on the single gpu. Do you have any idea?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
You can try setting the strategy='dp', then you can run muti opus, but you may get the same issue as mine.
I have tried this strategy, but it's stll OOM
------------------ 原始邮件 ------------------ 发件人: "lllyasviel/ControlNet" @.>; 发送时间: 2023年3月13日(星期一) 下午5:32 @.>; @.@.>; 主题: Re: [lllyasviel/ControlNet] Training on multi GPUs are always OOM while it working well on single GPU (Issue #242)
You can try setting the strategy='dp', then you can run muti opus, but you may get the same issue as mine.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
try fsdp or deepspeed
try fsdp or deepspeed
I tried it and got
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
16 GB is still limited as far as I concern.
Thanks for everyone, I've resolved the peroblem by using 4 P40 24G. 16GB is not enough for multi GPUs training.