Jianshu Wang
Jianshu Wang
Need some help on making this support Lora/Controlnets. As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will...
The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads. Briefly speaking, it does several things (all in a non-blocking...
Actually, there are 2 main pain points that drives me here: - To do `not moving from GPU to CPU` on the module's level, I need to clone the module...
A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet...
> As I understand it requires more vram then old lowvram. Maybe you should disable it by default? I profiled with `PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync` and without FP8. The original implementation takes 166...
> Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram It is the peak usage recorded by Nsight. `PYTORCH_CUDA_ALLOC_CONF` makes big...
A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.
May be your actual compute work is lagging behind. Use `nsys` to figure out. I can add synchronize mark there to constraint it, but it hurts the performance by a...
You can use nsys cli. Collect these data: | | | -- | -- Collect CUDA trace | On Collect CUDA's GPU memory usage | On
https://docs.nvidia.com/nsight-systems/UserGuide/index.html#installing-the-cli-on-your-target