Dreambooth-Stable-Diffusion
Dreambooth-Stable-Diffusion copied to clipboard
Any tip to make it run on A100 (I can only run on A6000 so far)?
Hi, thanks for providing the codes and it's been helpful so far. One thing I have in mind is that, if there is any tips to make it work on A100. I know somewhere there is already a discussion about the memory usage and it's said that the training pipeline uses 35+ GB. I tried it on a 8 A100 instances last night and it still gives out of memory issues. Running on a single A6000 works though as A6000 has slightly more GPU memory than A100. It's just 48GB vs 40GB though...
I was able to run it on V100 with 32gb ram. There might be some other issue.
please post your training config
Nothing special in training config. Used the same arguments as mentioned in readme.
I also cannot run the project on 32G V100 when I create the environment with environment.yaml, which uses pytroch1.10.2. However, when I update the pytorch to 1.12.1, I can run it successfully.
One another trick to reduce memory. This code is based on Textual Inversion, and TI does something here (https://github.com/rinongal/textual_inversion/blob/main/ldm/modules/diffusionmodules/util.py#L112), which disable gradient checkpointing in a hard-code way. This is because in TI, the Unet is not optimized. However, here we optimize the Unet, so we can turn on the gradient checkpoint point trick, as in the original SD repo (here https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L112). The gradient checkpoint is default to be True in config (here https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/blob/main/configs/stable-diffusion/v1-finetune_unfrozen.yaml#L47)
Nice trick, it reduces the memory from 31G to 27G.