stable-diffusion.cpp
                                
                                 stable-diffusion.cpp copied to clipboard
                                
                                    stable-diffusion.cpp copied to clipboard
                            
                            
                            
                        Much higher RAM usage (2-3 times) compared to FastSDCPU when using the exact same models/settings
Currently stable-diffusion.cpp seems to have a too high RAM usage compared to https://github.com/rupeshs/fastsdcpu (written in Python) for the same result.
I compared the Dreamshaper LCM model + TAESD at 5 steps and a resolution of 512x512 on stable-diffusion.cpp vs FastSDCPU, running on the CPU.
The speed is fully identical between both projects, I get ~4.4 s/it with both projects.
But stable-diffusion.cpp uses a peak of 2 GB RAM, or 1.6 GB with flash attention enabled, while FastSDCPU only uses a peak of 700 MB RAM. So stable-diffusion.cpp needs between 2-3x more RAM for the same result.
It looks like some significant optimizations would be possible in stable-diffusion.cpp that make it much more memory efficient.
Currently, im2col is being used for convolutions, which consumes a very high amount of RAM during the VAE phase.
I have been working on a kernel that merges im2col and matrix multiplications to avoid materializing a lot of data in memory, although that entails a 40% performance reduction. So far, I am only doing this for CUDA; for CPU it will be more difficult and will likely have a negative impact on performance.
Currently, im2col is being used for convolutions, which consumes a very high amount of RAM during the VAE phase.
But I did my comparison with TAESD instead of the VAE, so I think that means the VAE isn't used at all? TAESD is super lightweight already.