stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: --upcast-sampling is not working with CUDA
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
Enabling --upcast-sampling doesn't result in any performance effect on CUDA device. Disabling benchmark (basically, disabling FP16 emulation on GTX 16xx cards) confirms suspicion
Steps to reproduce the problem
- Add --upcast-sampling command line argument
- Try to generate picture
- Compare results with --no-half enabled
What should have happened?
Performance should've become closer to --no-half not stay the same as when emulating FP16 (2+ times slower than when using --no-half)
Commit where the problem happens
0cc0ee1
What platforms do you use to access the UI ?
Windows
What browsers do you use to access the UI ?
Microsoft Edge
Command Line Arguments
--opt-sub-quad-attention --upcast-sampling --medvram --no-half-vae
List of extensions
ControlNet
Console logs
0%| | 0/20 [00:14<?, ?it/s]
Error completing request
Arguments: ('task(g1nyc6eyh46h5g3)', 'test', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', True, False, False, 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
File "D:\stable-diffusion-webui\modules\call_queue.py", line 56, in f
res = list(func(*args, **kwargs))
File "D:\stable-diffusion-webui\modules\call_queue.py", line 37, in f
res = func(*args, **kwargs)
File "D:\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
processed = process_images(p)
File "D:\stable-diffusion-webui\modules\processing.py", line 481, in process_images
res = process_images_inner(p)
File "D:\stable-diffusion-webui\modules\processing.py", line 627, in process_images_inner
samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
File "D:\stable-diffusion-webui\modules\processing.py", line 827, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in sample
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 225, in launch_sampling
return func()
File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in <lambda>
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 143, in forward
devices.test_for_nans(x_out, "unet")
File "D:\stable-diffusion-webui\modules\devices.py", line 152, in test_for_nans
raise NansException(message)
modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.
Additional information
Log recorded with benchmark disabled (otherwise SD doesn't throw any error but it's clear that it just computes emulating FP16. Difference in speed is 2+ times). Upcast attention is also enabled in settings, if it matters
Also, don't know if it's worth noting but setting "--precision full" results in "expected Float but found Half"
what model is being used? tried to switch model?
also is xformers installed?
what model is being used? tried to switch model?
also is xformers installed?
Original 1.5, tried switching models. Will try to experiment with it a little more. Xformers are installed
Not sure if this can help somehow but it seems like error happens during calculating x_out in sd_samplers_kdiffusion. None of variables before calculating are NaN but output gives tensor full of NaNs. Is self.inner_model supposed to be FP16?
Managed to get it working partially (still slower than --no-half but faster than without it). Left benchmark enabled, enabled --upcast-sampling and --precision full. TIs and hypernetworks work but LoRAs throw "expected scalar type Float but found Half"
Solved LoRA problem by adding "input = devices.cond_cast_unet(input)" in the beginning of lora_forward function. It now works but generation becomes slower with LoRAs. I've seen some people reporting slow down while generating image with LoRAs so this might be not upcast related at all.