stable-diffusion-webui [WIP] Asynchronous model mover for lowvram

trafficstars

Description

This is an attempt to speed up --lowvram by taking the model moving out of the forward loop.
The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream.
A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.

I'm getting a 3.7it/s on a 3060 Laptop with half of the VRAM compared to --medvram. It was originally 1.65it/s. As a reference, the medvram speed was 5.8it/s.

Concerns

This is still a prototype, and not all original semantics are followed.
CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML.
The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.

Checklist:

[x] I have read contributing wiki page
[x] I have performed a self-review of my own code
[x] My code follows the style guidelines
[x] My code passes tests

Feb 07 '24 08:02 wfjsw

Smart mover

The smart mover does something similar to forge, and it only move tensors from CPU to GPU, but not coming back.

At some point, I was somehow able to get a same or even 2x faster speed than sd-webui-forge under --always-no-vram (which is something that is similar to --lowvram) when the max_prefetch is 5-8. Now I can only get it as fast as forge, and the output is broken somehow. Very unfortunately I did not save the file at that time. There are bugs hidden somewhere in the code but I am getting tired trying to find it.

I'd gonna leave it as is and come back when I am getting interested again.

Feb 09 '24 08:02 wfjsw

The broken images seems to be caused by not synchronize back the after usage to the creation stream. Fixed.

Also changed to layer-wise movement.

Feb 12 '24 03:02 wfjsw

There might be problem with extra networks. Haven't look into that.

Feb 12 '24 19:02 wfjsw

This looks very cool, but please don't change the formatting of those existing lines in lowvram.py (newlines and quotes), put those new classes into separate file and write a bit of comment there how the performance gain is achieved. Also maybe an option to use old method even if steaming is supported.

Feb 17 '24 06:02 AUTOMATIC1111

Need some help on making this support Lora/Controlnets.

As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will be taken.

Feb 17 '24 07:02 wfjsw

I'll be honest with you, I don't know how it works, so I can't help either; The "not moving from GPU to CPU" is smart and reasonable, and it can be implemented with ease, but cuda streams things would need me to get a lot more involved to understand.

Plus, there is FP8 support now, maybe that one can work better than lowvram for people who need it?

Feb 17 '24 08:02 AUTOMATIC1111

The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads.

Briefly speaking, it does several things (all in a non-blocking way to Python):

On the stream B, the cpu tensors is copied to cuda.
On the stream B, it record_event which can be seen as a timeline marker. It marks the tensor as ready
On the stream default, it waits for the ready event, and then compute forward with the tensor
On the stream default, it record_event which mark the work on this tensor is done.
On the stream B, it waits for the done event, so it eventually deallocates as the tensor is being deleted, after the tensor has finished its job.

Apart from the moving things, I have to do these things in addition:

Track the cuda tensor so it is used for forwarding
Save the event to wait them on the other stream
Maintain reference to tensors so it deallocates on the right time

Can the not moving from GPU to CPU really be implemented easily? Torch moves modules in-place and it's a major pain that prevents me from making the implemention simple, and forces me to place hooks on torch.nn.functional. I assume I have to do deep copy to achieve this and that sounds costly. CUDA stream looks, on the other hand, more easier.

Regarding FP8, I think it does not hurt if there is more options.

Feb 17 '24 08:02 wfjsw

Actually, there are 2 main pain points that drives me here:

To do not moving from GPU to CPU on the module's level, I need to clone the module and use it for forwarding. It can't be done with the forward hook and can't be reliably done with monkeypatching forward.
To actually gain performance benefit, I need to know next N tensors while hooking. I don't know how to do this in a reasonable way. There is an alternative way, which is avoiding cuda stream synchronization in the middle of the computation, so I can queue all jobs before they run. In that situation the result will not be immediately available to the Python world after forward. AFAIK, Torch, however perform synchronization on all module's forward so this is hard to do.

Feb 17 '24 09:02 wfjsw

A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet must be placed in non-pageable (pinned) memory (they will go to pageable if the module is somehow to-ed). Lora is tested.

However, should any extension / modules touches the weight and biases of the model (by using to(), for example), they need to make them pinned by ._apply(lambda x: x.pin_memory()). Otherwise it will fall back to the slow path.

Feb 21 '24 03:02 wfjsw

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

Feb 21 '24 07:02 light-and-ray

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

I profiled with PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync and without FP8. The original implementation takes 166 MB, while this implementation takes 387 MB. The differences is negligable, compared with the difference in sampling step time, which is 625 ms vs 229 ms.

Also the two streams does not go out of sync by a big margin.

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

False. This takes significantly less VRAM. 890 MB vs 350 MB. The speed difference is 200 ms per step vs 260 ms per step.

Feb 21 '24 08:02 wfjsw

I saw in discord async lowvram keep more then one layer in gpu. But maybe it really requires even less vram idk

The original implementation takes 755 MB

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

I will test this patch and original lowvram, medvram on it

Feb 21 '24 08:02 light-and-ray

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

It is the peak usage recorded by Nsight. PYTORCH_CUDA_ALLOC_CONF makes big difference here. the native backend do consumes ~1.6 GB VRAM. But I think it is a matter of GC and can resolve itself when there is VRAM pressure.

Feb 21 '24 08:02 wfjsw

A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.

Feb 21 '24 08:02 wfjsw

GPU MX150 2GB

ARGS: 
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
./webui.sh --xformers "$@"
+ fp8

4 steps 512x512


--lowvram
Time taken: 22.1 sec.
A: 0.96 GB, R: 1.19 GB, Sys: 1.3/1.95508 GB (66.0%)

--medvram
Time taken: 18.7 sec.
A: 1.33 GB, R: 1.72 GB, Sys: 1.8/1.95508 GB (93.2%)

this patch --lowvram
Time taken: 17.2 sec.
A: 1.17 GB, R: 1.84 GB, Sys: 1.9/1.95508 GB (99.7%)


torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

Hm, this patch really requres more vram for me

Maybe it ignores PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync? Maybe I need to update Nvidia driver? Or maybe --xformers is a problem?

Feb 21 '24 08:02 light-and-ray

The same vram usage, but slower...

This patch, no xformers and no PYTORCH_CUDA_ALLOC_CONF
Time taken: 23.2 sec.
A: 1.25 GB, R: 1.85 GB, Sys: 1.9/1.95508 GB (99.2%)

Feb 21 '24 08:02 light-and-ray

May be your actual compute work is lagging behind. Use nsys to figure out.

I can add synchronize mark there to constraint it, but it hurts the performance by a lot.

Without xformers it will be slower.

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

Feb 21 '24 08:02 wfjsw

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

Yes, but hires vram usage. 93% vs 99% XD

Feb 21 '24 08:02 light-and-ray

May be your actual compute work is lagging behind. Use nsys to figure out.

Can't install... Installed sudo apt install nsight-systems, but there are only nsys-ui, which doesn't work Cannot mix incompatible Qt library (5.15.10) with this library (5.15.2) (hate this QT compatible issues)

Feb 21 '24 08:02 light-and-ray

You can use nsys cli. Collect these data:


Collect CUDA trace	On
Collect CUDA's GPU memory usage	On

Feb 21 '24 08:02 wfjsw

I have only nsys-ui after installation. Maybe I need to reboot pc, but I'm afraid because Qt incompatible error. It's a bad signal, maybe I wont be able to boot my kde XD. I already had similar issue. And today I must be online

I will try to collect these data

Feb 21 '24 09:02 light-and-ray

https://docs.nvidia.com/nsight-systems/UserGuide/index.html#installing-the-cli-on-your-target

Feb 21 '24 09:02 wfjsw

@wfjsw check discord PM

Feb 21 '24 10:02 light-and-ray

IPEX does not seem to support pin_memory right now.

Feb 23 '24 08:02 wfjsw

To fix for default options:

Traceback (most recent call last):
  File "threading.py", line 973, in _bootstrap
  File "threading.py", line 1016, in _bootstrap_inner
  File "<enhanced_experience vendors.sentry_sdk.integrations.threading>", line 70, in run
  File "E:\novelai-webui\py310\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "E:\novelai-webui\py310\lib\site-packages\gradio\utils.py", line 707, in wrapper
    response = f(*args, **kwargs)
  File "E:\novelai-webui\modules\ui_extra_networks.py", line 419, in pages_html
    return refresh()
  File "E:\novelai-webui\modules\ui_extra_networks.py", line 425, in refresh
    pg.refresh()
  File "E:\novelai-webui\modules\ui_extra_networks_textual_inversion.py", line 13, in refresh
    sd_hijack.model_hijack.embedding_db.load_textual_inversion_embeddings(force_reload=True)
  File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 222, in load_textual_inversion_embeddings
    self.expected_shape = self.get_expected_shape()
  File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 154, in get_expected_shape
    vec = shared.sd_model.cond_stage_model.encode_embedding_init_text(",", 1)
  File "E:\novelai-webui\modules\shared_items.py", line 128, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "E:\novelai-webui\modules\sd_models.py", line 574, in get_sd_model
    errors.display(e, "loading stable diffusion model", full_traceback=True)
  File "E:\novelai-webui\modules\sd_models.py", line 571, in get_sd_model
    load_model()
  File "E:\novelai-webui\modules\sd_models.py", line 698, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "E:\novelai-webui\modules\sd_models.py", line 441, in load_model_weights
    module.to(torch.float8_e4m3fn)
  File "E:\novelai-webui\py310\lib\site-packages\torch\nn\modules\module.py", line 825, in _apply
    param_applied = fn(param)
  File "E:\novelai-webui\modules\sd_models.py", line 441, in <lambda>
    module.to(torch.float8_e4m3fn)
RuntimeError: cannot pin 'CUDAFloat8_e4m3fnType' only dense CPU tensors can be pinned

Feb 24 '24 04:02 wfjsw

Intel A750 8G (IPEX backend): this improves the performance from 0.7it/s to 1.5it/s with no significant VRAM usage increase.

Feb 25 '24 10:02 Nuullll

Someone says the LoRA is not actually working. Pending test.

UPDATE: I cannot reproduce

UPDATE: For FP16 LoRAs, it will have a hard time trying to apply them on CPUs. Need some cast here.

Feb 25 '24 17:02 wfjsw

TODO: add a queue somewhere to constraint the speed

Feb 27 '24 09:02 wfjsw

@light-and-ray can you try this? it no longer should oom now ~~nvm i implemented it wrongly~~

Mar 10 '24 10:03 wfjsw

It still uses more vram then medvram

Time taken: 17.2 sec.
A: 1.27 GB, R: 1.85 GB, Sys: 2.0/1.95508 GB (99.9%)

Mar 10 '24 18:03 light-and-ray

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[WIP] Asynchronous model mover for lowvram

Description

Concerns

Checklist:

Smart mover

stable-diffusion-webui
stable-diffusion-webui copied to clipboard