stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Fix issue with 16xx cards
16XX cards dont natively support FP32; but with this simple workaround they do work, without --precision full
and --no-half
This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.
So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?
Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.
Just added the 2 lines on an gtx1660 super.(6gb)
And indeed. I can start without command line parameters and the image I get is ok. (not black). But the performance absolutely collapses.
From 1.5 iteration / sec to 2.5sec / iteration.
With the same 2 lines active, but started with --no-half and --precision full the performance is back to normal.
I had a 1650 until recently and it worked fine with just "--medvram", didn't need to use no-half or such. (I'm on Linux)
Maybe you could check what GPU that is enabled, if it even is possible, to filter which should get it
This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.
So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?
Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.
I am a 1660 user, I use this fix in order to run it; and yeah, if there is a way to check if the gpu is a 16xx card, ill try and implement it, haven't found one yet
this might take more time on startup; since it loops over every card and loops over a list of turing cards and checks the name; but its better for the long run preformance
torch.cuda.get_device_capability(device) == (7, 5)
Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.
Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.
some 20xx cards are turing although mentioned by C43H66N12O12S2, ill implement the better solution
I can confirm this fix works for me on a 1660 SUPER. Till now I've had to use the args "--precision full" and "--no-half" otherwise I get black images. With this change made I no longer see black images even without those args. (In both cases I am also using "--medvram" and "--xformers")
It looks like @C43H66N12O12S2 was correct that it is the benchmarking change that is fixing this. I commented out "torch.backends.cudnn.enabled = True" and still saw this fix work. I guess that line can be removed from this change unless it has some other effect.
This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.
So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?
Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.
benchmark=True is the only thing that has an effect, yes. And as far as I know it improves performance if anything, at least on the second generation onwards once the benchmarking has already been done?
By the way, calculations with 16-bit floats are extremely slow on 16xx cards, so even with this fix you should always be using --no-half anyway unless you're truly desperate for vram. Might be worth updating the documentation accordingly. (Although I don't know exactly which set of cards has fast 16-bit and which set doesn't.)
@yoinked-h @C43H66N12O12S2 Maybe I need torch.backends.cudnn.benchmark_limit = 0
, because the total number of convolution algorithm benchmark tests is small, which can still lead to the possibility of issue occurring in my 1650 card.
ill try it out with torch2
ill try it out with torch2
According to some tutorial websites, it seems that only the 16 series will have issues with not working.
if any(["GeForce GTX 16" in torch.cuda.get_device_name(devid) for devid in range(0, torch.cuda.device_count())]):
may be better than
if any([torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())]):
yep; tensor cores are the main reason the 20xx series does fp32 normally, 16xx dont get that comfort