stable-diffusion-webui Fix issue with 16xx cards

16XX cards dont natively support FP32; but with this simple workaround they do work, without --precision full and --no-half

Nov 07 '22 01:11 yoinked-h

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

Nov 07 '22 14:11 C43H66N12O12S2

Just added the 2 lines on an gtx1660 super.(6gb)

And indeed. I can start without command line parameters and the image I get is ok. (not black). But the performance absolutely collapses.

From 1.5 iteration / sec to 2.5sec / iteration.

With the same 2 lines active, but started with --no-half and --precision full the performance is back to normal.

Nov 07 '22 18:11 TKoestlerx

I had a 1650 until recently and it worked fine with just "--medvram", didn't need to use no-half or such. (I'm on Linux)

Nov 07 '22 18:11 drax-xard

Maybe you could check what GPU that is enabled, if it even is possible, to filter which should get it

Nov 08 '22 00:11 yoinked-h

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

I am a 1660 user, I use this fix in order to run it; and yeah, if there is a way to check if the gpu is a 16xx card, ill try and implement it, haven't found one yet

Nov 08 '22 01:11 yoinked-h

this might take more time on startup; since it loops over every card and loops over a list of turing cards and checks the name; but its better for the long run preformance

Nov 08 '22 02:11 yoinked-h

torch.cuda.get_device_capability(device) == (7, 5)

Nov 08 '22 04:11 C43H66N12O12S2

Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.

Nov 08 '22 09:11 XiteSDF

Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.

some 20xx cards are turing although mentioned by C43H66N12O12S2, ill implement the better solution

Nov 08 '22 23:11 yoinked-h

I can confirm this fix works for me on a 1660 SUPER. Till now I've had to use the args "--precision full" and "--no-half" otherwise I get black images. With this change made I no longer see black images even without those args. (In both cases I am also using "--medvram" and "--xformers")

It looks like @C43H66N12O12S2 was correct that it is the benchmarking change that is fixing this. I commented out "torch.backends.cudnn.enabled = True" and still saw this fix work. I guess that line can be removed from this change unless it has some other effect.

Nov 28 '22 18:11 JackCopland

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

benchmark=True is the only thing that has an effect, yes. And as far as I know it improves performance if anything, at least on the second generation onwards once the benchmarking has already been done?

By the way, calculations with 16-bit floats are extremely slow on 16xx cards, so even with this fix you should always be using --no-half anyway unless you're truly desperate for vram. Might be worth updating the documentation accordingly. (Although I don't know exactly which set of cards has fast 16-bit and which set doesn't.)

Dec 03 '22 07:12 MrCheeze

@yoinked-h @C43H66N12O12S2 Maybe I need torch.backends.cudnn.benchmark_limit = 0, because the total number of convolution algorithm benchmark tests is small, which can still lead to the possibility of issue occurring in my 1650 card.

May 16 '23 04:05 pinyangcong

ill try it out with torch2

May 17 '23 00:05 yoinked-h

ill try it out with torch2

According to some tutorial websites, it seems that only the 16 series will have issues with not working. if any(["GeForce GTX 16" in torch.cuda.get_device_name(devid) for devid in range(0, torch.cuda.device_count())]): may be better than if any([torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())]):

May 17 '23 08:05 pinyangcong

yep; tensor cores are the main reason the 20xx series does fp32 normally, 16xx dont get that comfort

May 17 '23 23:05 yoinked-h

stable-diffusion-webui stable-diffusion-webui copied to clipboard

Fix issue with 16xx cards

stable-diffusion-webui
stable-diffusion-webui copied to clipboard