stable-diffusion-webui cuDNN benchmark for minor speed boost?

(E: made a discussion page about this, hopefully can help more people see it: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3353)

For my GTX 1080 it seems enabling cuDNN benchmarking reliably gives a small speed boost, training went from ~1.27s/it to ~1.05s/it (saving about 2 hours on the training estimate), while txt2img went from 5.30s/it to 5.07s/it.

Not a huge difference, but I wonder if RTX cards with tensor cores etc might have better increases?

To enable it I just edited modules/sd_models.py, and underneath def setup_model(): added:

    torch.backends.cudnn.benchmark = True

A caveat with that is the first txt2img run seems to take a little while to start up, and then also stays stuck at 100% for a little longer too, nullifying the speed boost from the reduced s/it... That only seems to happen on the first txt2img run though, any runs after that don't seem to have that issue (and "time taken" becomes ~30 seconds faster than without cuDNN) I guess this is because of cuDNN benchmarking each new action being made for the first time, but not sure.

Would be interested in hearing from any RTX users if this gives a bigger speed increase for them (might need to try multiple runs though)

Oct 16 '22 07:10 emoose

cuDNN is enabled by default in Torch.

Your gain is likely from the benchmark option, which is highly variable in its benefit on SD - so that's why we never enabled it by default.

Nice find, but probably belongs more to the wiki or a guide.

Oct 16 '22 08:10 C43H66N12O12S2

Ah right, I did notice that the benchmark line by itself would also give the same boost too, but included the enable line just to be safe, updated the OP to remove it.

Still wonder how it might affect RTX cards though (if it's even noticeable compared to xformers etc 😄)

Oct 16 '22 09:10 emoose

Hey there, I'd like to know how you modified def setup_model():. Is it changed to look like this below?

def setup_model():
    torch.backends.cudnn.benchmark = True
    if not os.path.exists(model_path):
        os.makedirs(model_path)
    list_models()

Oct 16 '22 12:10 Small-tailqwq

@Small-tailqwq yep that should enable it fine

Oct 16 '22 12:10 emoose

I'm using a 2060 graphics card on my laptop and I didn't get a significant performance boost after modifying the file, he would eat up more memory when I launched webui. Before the modification it was about 3.78it/s, after the modification 3.74it/s. Also I've been using --force-enable-xformers --opt-split-attention. i don't know if this has anything to do with it, at least it doesn't give me much of a boost.
I will continue to test

Oct 16 '22 12:10 Small-tailqwq

@Small-tailqwq if you didn't already maybe try doing multiple runs with it, for me it only started showing a good improvement from my second txt2img generation onward for some reason.

I'd guess the boost from xformers probably outweighs any improvement this could give though.

Oct 16 '22 12:10 emoose

经过多次测试，我得到了它在我电脑上的运行结果(RTX 2060 90W)

After many tests, I got the data that it works on my computer(RTX 2060 90W)

	1img	6imgs
默认（default	3.73it/s	3.57it/s
xformers	4.53it/s	4.32it/s
xfor\|split-attention	4.57it/s	4.32it/s
torch.backends.cudnn.benchmark	2.92it/s	3.48it/s
xformers\|⬆	3.39it/s	4.16it/s

可能这个修改对GTX系显卡有所提升？对我而言确实是xformers给我带来的效果更加直观，不过也仅限于20%

Maybe this modification has improved the GTX series graphics card? For me, the effect of xformers is more intuitive, but it is limited to 20%

Oct 16 '22 13:10 Small-tailqwq

The first iteration will be slower as it runs benchmarks on different variations of calculation methods. After the first iteration it will generally be faster. It's noted in the documentation that if the input sizes change a lot then it can actually be slower as it has to keep re-running the benchmarks.

I found that it sometimes seems to use more vram so isn't something I use all the time, with it enabled I get memory errors when generating huge images that I don't normally get.

Oct 16 '22 14:10 zwishenzug

Brother, I tested it again after installing cuDNN, it works with xformers on RTX3080, the speed was about 1.1it/s for 8 rounds and 8 sets of images before using it, after modifying the file with the same parameters the speed increased to 1.3it/s. And the lag after each round of image generation was significantly shorter, I apologize for not testing it properly before. I'm sorry I didn't test it properly before, and I'm happy to report that this finding is also useful for rtx cards.

兄弟，我在安装cuDNN之后再测试了一遍，它在RTX3080上可以配合xformers一起使用，使用前跑8轮8组图片大概速度为1.1it/s，修改文件之后以同样的参数运行速度提高到了1.3it/s。且每一轮图片生成之后的卡顿明显变短了，我很抱歉之前没有正确测试，我很高兴地告诉你，这个发现对rtx显卡同样有用。 Translated with www.DeepL.com/Translator (free version)

Oct 21 '22 13:10 Small-tailqwq

GTX 1660 Ti user here. The speed boost is very significant! With Euler 512x512, my speed went from 1.95-1.99 it/s to 2.13-2.15 it/s Very nice tips!

Oct 21 '22 15:10 congdm

GTX 1070 : from 1.45 to 1.35 on Euler_a, so a small downgrade, but 1.50it/s to 2.04 on Euler, a good upgrade. edit : nevermind, past the first run, it's faster for Euler_a > 1.90 ! edit2: seems like it takes a little more RAM, without it I can do 1024x1024, with it it always crashes. Tried 4 times on & off with the same prompt and settings

Oct 21 '22 17:10 Koumbaya

Here's my results on 3080Ti, fig size 512x512, eular a 30 steps, batch size=8:

settings	it/s
default	2.10
xformers	2.57
benchmark	2.00
benchmark+xformers	2.50

So cudnn.benchmark actually degraded a bit performance for me. But as long as someone may find a performance improvement, I think is it worth making it an option and avoid letting users editing the code.

Oct 22 '22 08:10 sgsdxzy

Might want to close this @ClashSAN

Dec 24 '22 07:12 aliencaocao

stable-diffusion-webui stable-diffusion-webui copied to clipboard

cuDNN benchmark for minor speed boost?

stable-diffusion-webui
stable-diffusion-webui copied to clipboard