Jianshu Wang comments

Results 116 comments of


                                            Jianshu Wang

[WIP] Asynchronous model mover for lowvram

Need some help on making this support Lora/Controlnets. As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will...

[WIP] Asynchronous model mover for lowvram

The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads. Briefly speaking, it does several things (all in a non-blocking...

[WIP] Asynchronous model mover for lowvram

Actually, there are 2 main pain points that drives me here: - To do `not moving from GPU to CPU` on the module's level, I need to clone the module...

[WIP] Asynchronous model mover for lowvram

A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet...

[WIP] Asynchronous model mover for lowvram

> As I understand it requires more vram then old lowvram. Maybe you should disable it by default? I profiled with `PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync` and without FP8. The original implementation takes 166...

[WIP] Asynchronous model mover for lowvram

> Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram It is the peak usage recorded by Nsight. `PYTORCH_CUDA_ALLOC_CONF` makes big...

[WIP] Asynchronous model mover for lowvram

A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.

[WIP] Asynchronous model mover for lowvram

May be your actual compute work is lagging behind. Use `nsys` to figure out. I can add synchronize mark there to constraint it, but it hurts the performance by a...

[WIP] Asynchronous model mover for lowvram

[WIP] Asynchronous model mover for lowvram

https://docs.nvidia.com/nsight-systems/UserGuide/index.html#installing-the-cli-on-your-target