stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: --upcast-sampling is not working with CUDA

Open FNSpd opened this issue 1 year ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

Enabling --upcast-sampling doesn't result in any performance effect on CUDA device. Disabling benchmark (basically, disabling FP16 emulation on GTX 16xx cards) confirms suspicion

Steps to reproduce the problem

  1. Add --upcast-sampling command line argument
  2. Try to generate picture
  3. Compare results with --no-half enabled

What should have happened?

Performance should've become closer to --no-half not stay the same as when emulating FP16 (2+ times slower than when using --no-half)

Commit where the problem happens

0cc0ee1

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Microsoft Edge

Command Line Arguments

--opt-sub-quad-attention --upcast-sampling --medvram --no-half-vae

List of extensions

ControlNet

Console logs

0%|          | 0/20 [00:14<?, ?it/s]
Error completing request
Arguments: ('task(g1nyc6eyh46h5g3)', 'test', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', True, False, False, 1, '', 0, '', 0, '', True, False, False, False, 0) {}
Traceback (most recent call last):
  File "D:\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "D:\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "D:\stable-diffusion-webui\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "D:\stable-diffusion-webui\modules\processing.py", line 481, in process_images
    res = process_images_inner(p)
  File "D:\stable-diffusion-webui\modules\processing.py", line 627, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "D:\stable-diffusion-webui\modules\processing.py", line 827, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 225, in launch_sampling
    return func()
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 349, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "C:\Users\stasd\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 143, in forward
    devices.test_for_nans(x_out, "unet")
  File "D:\stable-diffusion-webui\modules\devices.py", line 152, in test_for_nans
    raise NansException(message)
modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

Additional information

Log recorded with benchmark disabled (otherwise SD doesn't throw any error but it's clear that it just computes emulating FP16. Difference in speed is 2+ times). Upcast attention is also enabled in settings, if it matters

FNSpd avatar Mar 07 '23 12:03 FNSpd

Also, don't know if it's worth noting but setting "--precision full" results in "expected Float but found Half"

FNSpd avatar Mar 07 '23 12:03 FNSpd

what model is being used? tried to switch model?

also is xformers installed?

fractal-fumbler avatar Mar 08 '23 11:03 fractal-fumbler

what model is being used? tried to switch model?

also is xformers installed?

Original 1.5, tried switching models. Will try to experiment with it a little more. Xformers are installed

FNSpd avatar Mar 08 '23 12:03 FNSpd

Not sure if this can help somehow but it seems like error happens during calculating x_out in sd_samplers_kdiffusion. None of variables before calculating are NaN but output gives tensor full of NaNs. Is self.inner_model supposed to be FP16? image

FNSpd avatar Mar 10 '23 07:03 FNSpd

Managed to get it working partially (still slower than --no-half but faster than without it). Left benchmark enabled, enabled --upcast-sampling and --precision full. TIs and hypernetworks work but LoRAs throw "expected scalar type Float but found Half"

FNSpd avatar Mar 10 '23 14:03 FNSpd

Solved LoRA problem by adding "input = devices.cond_cast_unet(input)" in the beginning of lora_forward function. It now works but generation becomes slower with LoRAs. I've seen some people reporting slow down while generating image with LoRAs so this might be not upcast related at all.

FNSpd avatar Mar 13 '23 08:03 FNSpd