[Performance] Keep sigmas on CPU
Description
- Currently, the k diffusion sampler code creates a tensor for sigmas on the GPU. This will cause forced device syncs every step because they are used for control flow within the sampler code.
- I changed them to stay on CPU. Every operation using the sigma values works when the value is on the CPU (much like it would if you had a native python list of floats). This also allows it to work ahead as control flow is no longer dependent on the GPU.
- There are still other blocking ops, so this may not give an immediate performance benefit. I'll open a separate PR for the other one that I know of.
Checklist:
- [x] I have read contributing wiki page
- [x] I have performed a self-review of my own code
- [x] My code follows the style guidelines
- [x] My code passes tests
It seems it interferes with https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/15751?
I get
Traceback (most recent call last):
File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
res = list(func(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
processed = processing.process_images(p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
res = process_images_inner(p)
^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1322, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 181, in sample
sigmas = self.get_sigmas(p, steps)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 118, in get_sigmas
sigmas = scheduler.function(n=steps, **sigmas_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: get_align_your_steps_sigmas() missing 1 required positional argument: 'device'
---
It seems it interferes with #15751?
That is a problem on their end, and can be resolved by not sending the sigmas to the device.
I was testing with hr-fix and it seems it doesn't work?
Traceback (most recent call last):
File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
res = list(func(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
processed = processing.process_images(p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
res = process_images_inner(p)
^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1338, in sample
return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1423, in sample_hr_pass
samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in sample_img2img
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
return func()
^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in <lambda>
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 71, in restart_sampler
x = heun_step(x, old_sigma, new_sigma)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 20, in heun_step
d = to_d(x, old_sigma, denoised)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Stable difussion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 48, in to_d
return (x - denoised) / utils.append_dims(sigma, x.ndim)
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
---
It only happens when hi-res starts to generate (after tiled upscale process)
I was testing with hr-fix and it seems it doesn't work?
That's not a hi-res fix issue, it's a k-diffusion issue specific to that sampler (Euler and Heun at least, DPM++ 2M works fine). I'll have to work on an upstream patch for k-diffusion for this fix to be viable, then. Marking as draft until then.
So, for this to resolve, either this PR needs to be merged upstream (https://github.com/crowsonkb/k-diffusion/pull/109) or I have to monkey patch the function. This probably isn't going to be a merge candidate any time soon so I'll give the upstream PR some time.
Now that the new schedulers are merged, I've went ahead and patched the to_d method there and have stripped the device argument from all schedulers as it is no longer necessary. Should be ready to merge.
Is there a performance improvement?
Is there a performance improvement?
Yes. As stated, the current implementation induces a forced device sync during every sampling step, and the CPU will have to wait on the GPU to be done with its work for certain control flow operations. I have a more detailed explanation of its impacts here: https://github.com/crowsonkb/k-diffusion/issues/108. And it is completely safe to have the sigmas tensor on the CPU (at least without this single operation that I had to patch) because the sigmas are only used for control flow and scalar operations.
Notably, this is also the last blocking operation that I am aware of in the main inference loop, so pytorch dispatch will be completely unblocked after this and you should see nearly uninterrupted 100% GPU usage.
What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.
What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.
It's probably easiest to look at GPU usage with a fast update speed, since it ideally should almost never go below 100. Without it, you will notice a small dip between each step. It is per step so it may be subtle, it would probably be somewhat more pronounced on slower CPUs. But if you can see that dip, you know that it is not using the GPU for some period of time where it should be using it.
My standard when testing this was always 512x512 batch size 4. For completeness, I was also using --opt-channelslast (which should be noticeably faster on Ampere and above after the other performance patches where it wasn't always before). Live preview will probably also need to be off since it also includes blocking operations.
Here's what I get with disabled live previews:
dev:
100%|█████████████████| 50/50 [00:13<00:00, 3.73it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.70it/s]
This PR:
100%|█████████████████| 50/50 [00:13<00:00, 3.63it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.68it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]
Let's see if anyone else reports improvement.
All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing
sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)
There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.
All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing
sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.
Yeah, I agree the device arguments don't necessarily have to be removed. Setting it to CPU should work. The to_d patch is also necessary to not break Euler and Heun and a couple of other samplers I think, since it tries to perform a pointwise op which would require the sample and sigmas on the same device (unless we have different sigmas in a batch this is the same as the scalar multiply it is now).