How app. py performs parallel computing through multiple GPUs

Open Alexkerl opened this issue 1 year ago • 4 comments

I set CUDA_VISIBLE_DEVICES=0,1,2,3 but but it only calculates on single GPU

Feb 02 '24 02:02 Alexkerl

https://huggingface.co/docs/diffusers/training/distributed_inference

Feb 02 '24 05:02 Paper99

https://huggingface.co/docs/diffusers/training/distributed_inference

The above tutorial may be helpful for distributed running, but if I want to run this program on a 2080ti of 4 * 12GB, I will still encounter an out of memory issue

Feb 02 '24 06:02 Alexkerl

Try to switch dtype to torch.bfloat16. It seems to work on cpu mode on 2080ti, which leads to lower speed.

Besides, you could refer to the official implementation on reducing memory usage: https://huggingface.co/docs/diffusers/main/en/optimization/memory

Feb 02 '24 07:02 Paper99

Using this link as a solution.

Feb 02 '24 17:02 Paper99