How app. py performs parallel computing through multiple GPUs
I set CUDA_VISIBLE_DEVICES=0,1,2,3 but but it only calculates on single GPU
https://huggingface.co/docs/diffusers/training/distributed_inference
https://huggingface.co/docs/diffusers/training/distributed_inference
The above tutorial may be helpful for distributed running, but if I want to run this program on a 2080ti of 4 * 12GB, I will still encounter an out of memory issue
Try to switch dtype to torch.bfloat16. It seems to work on cpu mode on 2080ti, which leads to lower speed.
Besides, you could refer to the official implementation on reducing memory usage: https://huggingface.co/docs/diffusers/main/en/optimization/memory
Using this link as a solution.