stable-diffusion-webui [Bug]: --upcast-sampling is not working with CUDA

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Enabling --upcast-sampling doesn't result in any performance effect on CUDA device. Disabling benchmark (basically, disabling FP16 emulation on GTX 16xx cards) confirms suspicion

Steps to reproduce the problem

Add --upcast-sampling command line argument
Try to generate picture
Compare results with --no-half enabled

What should have happened?

Performance should've become closer to --no-half not stay the same as when emulating FP16 (2+ times slower than when using --no-half)

Commit where the problem happens

0cc0ee1

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Microsoft Edge

Command Line Arguments

--opt-sub-quad-attention --upcast-sampling --medvram --no-half-vae

List of extensions

ControlNet

Console logs

0%|          | 0/20 [00:14<?, ?it/s]
Error completing request
Arguments: ('task(g1nyc6eyh46h5g3)', 'test', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', True, False, False, 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "D:\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "D:\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "D:\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "D:\stable-diffusion-webui\modules\processing.py", line 481, in process_images
    res = process_images_inner(p)
  File "D:\stable-diffusion-webui\modules\processing.py", line 627, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "D:\stable-diffusion-webui\modules\processing.py", line 827, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 225, in launch_sampling
    return func()
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 143, in forward
    devices.test_for_nans(x_out, "unet")
  File "D:\stable-diffusion-webui\modules\devices.py", line 152, in test_for_nans
    raise NansException(message)
modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

Additional information

Log recorded with benchmark disabled (otherwise SD doesn't throw any error but it's clear that it just computes emulating FP16. Difference in speed is 2+ times). Upcast attention is also enabled in settings, if it matters

Mar 07 '23 12:03 FNSpd

Also, don't know if it's worth noting but setting "--precision full" results in "expected Float but found Half"

Mar 07 '23 12:03 FNSpd

what model is being used? tried to switch model?

also is xformers installed?

$fractal-fumbler avatar$ Mar 08 '23 11:03 fractal-fumbler

what model is being used? tried to switch model?

also is xformers installed?

Original 1.5, tried switching models. Will try to experiment with it a little more. Xformers are installed

Mar 08 '23 12:03 FNSpd

Not sure if this can help somehow but it seems like error happens during calculating x_out in sd_samplers_kdiffusion. None of variables before calculating are NaN but output gives tensor full of NaNs. Is self.inner_model supposed to be FP16?

Mar 10 '23 07:03 FNSpd

Managed to get it working partially (still slower than --no-half but faster than without it). Left benchmark enabled, enabled --upcast-sampling and --precision full. TIs and hypernetworks work but LoRAs throw "expected scalar type Float but found Half"

Mar 10 '23 14:03 FNSpd

Solved LoRA problem by adding "input = devices.cond_cast_unet(input)" in the beginning of lora_forward function. It now works but generation becomes slower with LoRAs. I've seen some people reporting slow down while generating image with LoRAs so this might be not upcast related at all.

Mar 13 '23 08:03 FNSpd

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: --upcast-sampling is not working with CUDA

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access the UI ?

What browsers do you use to access the UI ?

Command Line Arguments

List of extensions

Console logs

Additional information

stable-diffusion-webui
stable-diffusion-webui copied to clipboard