stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[WIP] Asynchronous model mover for lowvram
Description
- This is an attempt to speed up
--lowvram
by taking the model moving out of the forward loop. - The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream.
- A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.
I'm getting a 3.7it/s on a 3060 Laptop with half of the VRAM compared to --medvram
. It was originally 1.65it/s. As a reference, the medvram speed was 5.8it/s.
Concerns
- This is still a prototype, and not all original semantics are followed.
- CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML.
- The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.
Checklist:
- [x] I have read contributing wiki page
- [x] I have performed a self-review of my own code
- [x] My code follows the style guidelines
- [x] My code passes tests
Smart mover
The smart mover does something similar to forge, and it only move tensors from CPU to GPU, but not coming back.
At some point, I was somehow able to get a same or even 2x faster speed than sd-webui-forge under --always-no-vram
(which is something that is similar to --lowvram
) when the max_prefetch
is 5-8. Now I can only get it as fast as forge, and the output is broken somehow. Very unfortunately I did not save the file at that time. There are bugs hidden somewhere in the code but I am getting tired trying to find it.
I'd gonna leave it as is and come back when I am getting interested again.
The broken images seems to be caused by not synchronize back the after usage to the creation stream. Fixed.
Also changed to layer-wise movement.
There might be problem with extra networks. Haven't look into that.
This looks very cool, but please don't change the formatting of those existing lines in lowvram.py (newlines and quotes), put those new classes into separate file and write a bit of comment there how the performance gain is achieved. Also maybe an option to use old method even if steaming is supported.
Need some help on making this support Lora/Controlnets.
As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will be taken.
I'll be honest with you, I don't know how it works, so I can't help either; The "not moving from GPU to CPU" is smart and reasonable, and it can be implemented with ease, but cuda streams things would need me to get a lot more involved to understand.
Plus, there is FP8 support now, maybe that one can work better than lowvram for people who need it?
The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads.
Briefly speaking, it does several things (all in a non-blocking way to Python):
- On the stream B, the cpu tensors is copied to cuda.
- On the stream B, it
record_event
which can be seen as a timeline marker. It marks the tensor as ready - On the stream default, it waits for the
ready
event, and then compute forward with the tensor - On the stream default, it
record_event
which mark the work on this tensor is done. - On the stream B, it waits for the
done
event, so it eventually deallocates as the tensor is being deleted, after the tensor has finished its job.
Apart from the moving things, I have to do these things in addition:
- Track the cuda tensor so it is used for forwarding
- Save the event to wait them on the other stream
- Maintain reference to tensors so it deallocates on the right time
Can the not moving from GPU to CPU
really be implemented easily? Torch moves modules in-place and it's a major pain that prevents me from making the implemention simple, and forces me to place hooks on torch.nn.functional
. I assume I have to do deep copy to achieve this and that sounds costly. CUDA stream looks, on the other hand, more easier.
Regarding FP8, I think it does not hurt if there is more options.
Actually, there are 2 main pain points that drives me here:
- To do
not moving from GPU to CPU
on the module's level, I need to clone the module and use it for forwarding. It can't be done with the forward hook and can't be reliably done with monkeypatching forward. - To actually gain performance benefit, I need to know next N tensors while hooking. I don't know how to do this in a reasonable way. There is an alternative way, which is avoiding cuda stream synchronization in the middle of the computation, so I can queue all jobs before they run. In that situation the result will not be immediately available to the Python world after forward. AFAIK, Torch, however perform synchronization on all module's
forward
so this is hard to do.
A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet must be placed in non-pageable (pinned) memory (they will go to pageable if the module is somehow to
-ed). Lora is tested.
However, should any extension / modules touches the weight and biases of the model (by using to()
, for example), they need to make them pinned by ._apply(lambda x: x.pin_memory())
. Otherwise it will fall back to the slow path.
As I understand it requires more vram then old lowvram. Maybe you should disable it by default?
Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller
As I understand it requires more vram then old lowvram. Maybe you should disable it by default?
I profiled with PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
and without FP8. The original implementation takes 166 MB, while this implementation takes 387 MB. The differences is negligable, compared with the difference in sampling step time, which is 625 ms vs 229 ms.
Also the two streams does not go out of sync by a big margin.
Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller
False. This takes significantly less VRAM. 890 MB vs 350 MB. The speed difference is 200 ms per step vs 260 ms per step.
I saw in discord async lowvram keep more then one layer in gpu. But maybe it really requires even less vram idk
The original implementation takes 755 MB
Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram
I will test this patch and original lowvram, medvram on it
Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram
It is the peak usage recorded by Nsight.
PYTORCH_CUDA_ALLOC_CONF
makes big difference here. the native
backend do consumes ~1.6 GB VRAM. But I think it is a matter of GC and can resolve itself when there is VRAM pressure.
A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.
GPU MX150 2GB
ARGS:
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
./webui.sh --xformers "$@"
+ fp8
4 steps 512x512
--lowvram
Time taken: 22.1 sec.
A: 0.96 GB, R: 1.19 GB, Sys: 1.3/1.95508 GB (66.0%)
--medvram
Time taken: 18.7 sec.
A: 1.33 GB, R: 1.72 GB, Sys: 1.8/1.95508 GB (93.2%)
this patch --lowvram
Time taken: 17.2 sec.
A: 1.17 GB, R: 1.84 GB, Sys: 1.9/1.95508 GB (99.7%)
torch: 2.1.2+cu121 • xformers: 0.0.23.post1
Hm, this patch really requres more vram for me
Maybe it ignores PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
? Maybe I need to update Nvidia driver? Or maybe --xformers
is a problem?
The same vram usage, but slower...
This patch, no xformers and no PYTORCH_CUDA_ALLOC_CONF
Time taken: 23.2 sec.
A: 1.25 GB, R: 1.85 GB, Sys: 1.9/1.95508 GB (99.2%)
May be your actual compute work is lagging behind. Use nsys
to figure out.
I can add synchronize mark there to constraint it, but it hurts the performance by a lot.
Without xformers it will be slower.
One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?
One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?
Yes, but hires vram usage. 93% vs 99% XD
May be your actual compute work is lagging behind. Use
nsys
to figure out.
Can't install... Installed sudo apt install nsight-systems
, but there are only nsys-ui, which doesn't work Cannot mix incompatible Qt library (5.15.10) with this library (5.15.2)
(hate this QT compatible issues)
You can use nsys cli. Collect these data:
Collect CUDA trace | On |
Collect CUDA's GPU memory usage | On |
I have only nsys-ui
after installation. Maybe I need to reboot pc, but I'm afraid because Qt incompatible error. It's a bad signal, maybe I wont be able to boot my kde XD. I already had similar issue. And today I must be online
I will try to collect these data
https://docs.nvidia.com/nsight-systems/UserGuide/index.html#installing-the-cli-on-your-target
@wfjsw check discord PM
IPEX does not seem to support pin_memory
right now.
To fix for default options:
Traceback (most recent call last):
File "threading.py", line 973, in _bootstrap
File "threading.py", line 1016, in _bootstrap_inner
File "<enhanced_experience vendors.sentry_sdk.integrations.threading>", line 70, in run
File "E:\novelai-webui\py310\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "E:\novelai-webui\py310\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "E:\novelai-webui\modules\ui_extra_networks.py", line 419, in pages_html
return refresh()
File "E:\novelai-webui\modules\ui_extra_networks.py", line 425, in refresh
pg.refresh()
File "E:\novelai-webui\modules\ui_extra_networks_textual_inversion.py", line 13, in refresh
sd_hijack.model_hijack.embedding_db.load_textual_inversion_embeddings(force_reload=True)
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 222, in load_textual_inversion_embeddings
self.expected_shape = self.get_expected_shape()
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 154, in get_expected_shape
vec = shared.sd_model.cond_stage_model.encode_embedding_init_text(",", 1)
File "E:\novelai-webui\modules\shared_items.py", line 128, in sd_model
return modules.sd_models.model_data.get_sd_model()
File "E:\novelai-webui\modules\sd_models.py", line 574, in get_sd_model
errors.display(e, "loading stable diffusion model", full_traceback=True)
File "E:\novelai-webui\modules\sd_models.py", line 571, in get_sd_model
load_model()
File "E:\novelai-webui\modules\sd_models.py", line 698, in load_model
load_model_weights(sd_model, checkpoint_info, state_dict, timer)
File "E:\novelai-webui\modules\sd_models.py", line 441, in load_model_weights
module.to(torch.float8_e4m3fn)
File "E:\novelai-webui\py310\lib\site-packages\torch\nn\modules\module.py", line 825, in _apply
param_applied = fn(param)
File "E:\novelai-webui\modules\sd_models.py", line 441, in <lambda>
module.to(torch.float8_e4m3fn)
RuntimeError: cannot pin 'CUDAFloat8_e4m3fnType' only dense CPU tensors can be pinned
Intel A750 8G (IPEX backend): this improves the performance from 0.7it/s to 1.5it/s with no significant VRAM usage increase.
Someone says the LoRA is not actually working. Pending test.
UPDATE: I cannot reproduce
UPDATE: For FP16 LoRAs, it will have a hard time trying to apply them on CPUs. Need some cast here.
TODO: add a queue somewhere to constraint the speed
@light-and-ray can you try this? it no longer should oom now ~~nvm i implemented it wrongly~~
It still uses more vram then medvram
Time taken: 17.2 sec.
A: 1.27 GB, R: 1.85 GB, Sys: 2.0/1.95508 GB (99.9%)