InvokeAI
InvokeAI copied to clipboard
LoRA patching optimization
Summary
This implements a scheme in which the original weights are not restored after LoRA patching, saving a second or so per LoRA. This only occurs under the following conditions:
- The execution device is CUDA
- The model in question has the
load_state_dict()method used for the optimized RAM->VRAM copy operation. - There is insufficient VRAM cache to hold the model.
Otherwise the original weights are restored as per the previous behavior.
NOTE: The lazy_offloading option has been disabled. Lazy offloading defers the VRAM offloading step till the next time the VRAM cache is accessed, and under some circumstances could have lead to a patched version of the model being reused.
Related Issues / Discussions
QA Instructions
Merge Plan
Checklist
- [x] The PR has a short but descriptive title, suitable for a changelog
- [x] Tests added / updated (if applicable)
- [x] Documentation added / updated (if applicable)
Do we have a rough idea of what percentage of loras would be eligible for this improved scheme?
100% of LoRAs. All diffusers and safetensors "main" models will benefit. The only exception are ONNX models.
Oh - I misinterpreted the requirements as being for the lora, instead of the main model. Cool!
I feel like we are choosing a solution here without fully understanding the problem. I mentioned this briefly on Discord here: https://discord.com/channels/1020123559063990373/1049495067846524939/1243665704646082592
The answers to some of those questions might lead us to very different solutions.
NOTE: The lazy_offloading option has been disabled. Lazy offloading defers the VRAM offloading step till the next time the VRAM cache is accessed, and under some circumstances could have lead to a patched version of the model being reused.
Can you explain this decision in more detail, and what the goal is? There are a couple parts of it that don't make sense to me:
- From looking at
offload_unlocked_models(...), disabling lazy offloading changes the likelihood that models get offloaded, but doesn't change any guarantees around whether models get offloaded. How does this solve the problem you were facing? If the goal is to guarantee that models get offloaded, then we would have to remove the VRAM cache altogether.- In the non-lazy offloading case we call
self._cache.offload_unlocked_models(self._cache_entry.size)on offload. The choice ofself._cache_entry.sizeseems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.
The model in RAM is the same size as the model in VRAM. If the model is larger than max_vram_cache then it will be offloaded on the next call to offload_unlocked_models regardless of the capacity size requested in offload_unlocked_models(). Removing lazy_offloading guarantees that offload_unlocked_models will be called before the model is next needed.
In the non-lazy offloading case we call self._cache.offload_unlocked_models(self._cache_entry.size) on offload. The choice of self._cache_entry.size seems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.
You've caught a bug. The size should be zero. We want to offload models that are consuming more space than the VRAM cache limit. In practice this bug makes the VRAM cache less effective.
I think I understand what you are trying to do now. I had missed that has_transient_weights(...) was checking cache_entry.size.
To summarize my understanding: By removing lazy offloading, we are forcing large models ( > max_vram_cache_size ) to be offloaded in slightly more cases than they otherwise would (specifically cases when they are re-used immediately). We do this so that we can accurately predict cases where they will definitely be offloaded, and apply a LoRA optimization in those cases.
Let me think about this some more... there might be a simpler way.
Thinking aloud, I have a few concerns about the current approach:
- Whether the LoRA optimization gets applied depends on the relative sizes of the model and the VRAM cache. This makes it hard to tune the VRAM cache size, because there are 2 things changing at once (offloading behaviour and lora patching behaviour).
has_transient_weights(...)is a low-recall, perfect-precision predictor of whether the weights will be offloaded. I.e. it is really determining whether the weights will definitely be offloaded. There are cases where the weights will be offloaded, but it's hard to predict so we miss out on opportunities to apply the LoRA optimizations.- The current heuristic happens to work reasonably well given the relative size of the stable diffusion sub-models, but we shouldn't bake that into our implementation. Imagine, for example, that the VAE was larger than the UNet and TEs. In that context, it's less obvious to me that getting rid of lazy unloading is a good idea.
Brainstorming
There are 2 broad types of model patches: Weight-only: Patches that only modify the weights of the model. These can be unpatched by either an unpatch operation or VRAM offload. Shape: Patches that modify the structure/shape of the model (e.g. TI) - these can only be unpatched by an unpatch operation.
For weight-only patches, when should we unpatch via VRAM offload?
- If we must offload the model before its next use to make room for another one, then we should unpatch via VRAM offload.
- If we don't have to offload the model before it's next use, then we should choose the faster of: 1) unpatch operation, 2) VRAM offload + VRAM load.
The main complicating factor is that it can be hard to predict whether a model will need to be offloaded before its next use. On an 4GB GPU, offloads are frequent. On an 8GB GPU, it depends on your workload. On a 16GB GPU, offloads are uncommon.
Option 0 (the current implementation):
- Downside: At a certain GPU size (8GB? 16GB?) you probably get better results by keeping lazy offload and just making sure your VRAM cache size is large enough.
Option 1: Just make LoRA patching faster
- What is the slow part of LoRA patching/unpatching? Can we make it 10x faster? Then these unpatch-via-offload optimizations become almost irrelevant.
- Do we have unnecessary CUDA synchronizations in the patching process?
- If it's the VRAM-to-CPU copy during patching, could we give the patcher access to the model cache's CPU copy for this?
Option 2: unpatch_lora_by_vram_offload Flag
- Use a config flag (e.g.
unpatch_lora_by_vram_offload) to force LoRA unpatch via offloading globally, irrespective of the VRAM cache size. - This makes it much easier to tune configs and know what's going on.
- We could optionally make this configurable on a per-model basis. If we do that, then I think this is strictly better than Option 0. (We could configure it to behave like Option 0, if desired.)
Option 3: Unpatch stack
- As patches are applied, register a stack of unpatch operations that must be applied. Only apply them if the model isn't offloaded before it's next use.
- Handling interactions between weight-only and shape patches would be complicated.
- If we are willing to add this level of complexity, we might as well figure out how to avoid unnecessary unpatch/repatch as well.
What do you think? Do you agree with the downsides of the current approach? Does this spark any ideas?
I wonder how this is managed in other applications that regularly load and modify large models? I suppose the most accessible prior art would be other image gen applications (comfy, A1111, SD.Next, Forge, Fooocus) - I'd guess LLMs don't manipulate weights as frequently. It could be that this is a solved problem.
I also wonder if memory management should be so opaque to the user. Maybe it's better to lean toward optimizations being configurable with flags, and let the user tune the app. There are bound to be systems or usage patterns where our optimizations are detrimental.
Here's some statistics on LoRA patching time. Timings are for an SDXL model with the "add_detail_xl" and "VoxelXL_v1" LoRAS:
Time to patch the weights in VRAM: 0.70s Time to copy original weights to RAM: 1.27s Time to restore original weights: 0.43s
So the total overhead for weight copying and restoration is 1.7 of a total 2.4s total process, or 71%.
Options 1 and 3 will create a dependency between the model patcher and the model manager, which is something I wanted to avoid for no other reason than it requires passing a pointer to the model manager service through several layers of code and makes it difficult to use the model patcher module outside of the web app.
Option 4: Flag Models that Need to be Offloaded
However, if this is OK architecturally, then I would add a fourth option, which is to allow the patcher to set a dirty flag in the model's cache entry (presumably via a method call to the model manager). This would tell the model manager to unconditionally offload the model from VRAM. This would be coupled to a unpatch_lora_by_vram_offload config flag that gives the user control of this behavior.
[Edit] This preserves the performance benefits of the VRAM cache and lazy_offloading without adding too much complexity.
- In the non-lazy offloading case we call
self._cache.offload_unlocked_models(self._cache_entry.size)on offload. The choice ofself._cache_entry.sizeseems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.
This is addressed in PR #6450 . Most if not all people have lazy_offloading enabled, so this bit of code was rarely called. It would have had the performance impact of clearing the VRAM cache unecessarily.
Here's some statistics on LoRA patching time. Timings are for an SDXL model with the "add_detail_xl" and "VoxelXL_v1" LoRAS:
Time to patch the weights in VRAM: 0.70s Time to copy original weights to RAM: 1.27s Time to restore original weights: 0.43s
So the total overhead for weight copying and restoration is 1.7 of a total 2.4s total process, or 71%.
Building on this, my intuition tells me that "Time to patch the weights in VRAM" grows linearly with number of LoRAs, while the time to copy/restore is constant regardless of number of LoRAs. Does that match your tests? Most people probably use 0-3 LoRAs, so that doesn't change things too much.
Does "Time to patch the weights in VRAM" include time to load the original LoRA state_dict into VRAM? Is that negligible?
For comparison, on this test system, what is the time to load the TEs and UNet from RAM to VRAM?
Options 1 and 3 will create a dependency between the model patcher and the model manager, which is something I wanted to avoid for no other reason than it requires passing a pointer to the model manager service through several layers of code and makes it difficult to use the model patcher module outside of the web app.
For Option 3, I agree. (For the record, I think Option 3 is pretty bad idea 😆 )
For Option 1, I don't think it needs to introduce any dependency between the two. For example:
with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device),
apply_lora(lora, model_on_device, read_only_cpu_sd):
...
Should workshop the API, but the point is that any potential coupling can be moved up to the caller.
For Option 1, I don't think it needs to introduce any dependency between the two. For example:
with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device), apply_lora(lora, model_on_device, read_only_cpu_sd): ...Should workshop the API, but the point is that any potential coupling can be moved up to the caller.
Yeah, I was going to propose something similar, but using a config file setting to decide whether the model should be read only or not, rather than asking the model manager to make the decision. I think that @psychedelicious is right to leave the optimization decision to the user.
Here's my version of the pseudocode:
# in invokeai.yaml
unpatch_lora_weights = false
# in the invocation
read_only_model = not app_config.unpatch_lora_weights
with model_info(read_only_model) as model_on_device:
apply_lora(lora, model_on_device, read_only_model)
When read_only_model is true, the LoadedModel context manager will set a flag in the cache indicating that the model manager should dispose of the VRAM copy of the model and disable lazy loading for this model. The model patcher will skip the saving/restore steps.
You're quite right that the weight save/restore step is a one-time overhead that doesn't significantly change as you add more LoRAs (it scales with the total number of unique keys that are patched). The patching timing that I reported included the time to move the LoRA weights into VRAM, but I didn't break that out from the time to update the model weights.
What's the next step?
For Option 1, I don't think it needs to introduce any dependency between the two. For example:
with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device), apply_lora(lora, model_on_device, read_only_cpu_sd): ...
Something that I think may have been missed in this pseudo-code is that read_only_cpu_sd is the full CPU state dict, not a bool. This has the advantage that we could save the 1.27s to copy weights to CPU in all cases, without making the user decide whether they want to enable unpatch_lora_weights based on their HW and usage patterns.
What's the next step?
I think we are converging on a couple of options. A final proposal summarizing those options and the tradeoffs would be helpful.
The one thing that I think is still unknown is whether the current LoRA patching / unpatching is well-optimized. There's a bunch of weight copying intertwined with matmuls in there - it's plausible to me that we could make it 10x faster, but I really have no idea. Optimizing that code might change where the bottlenecks are.
@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.
For Option 1, I don't think it needs to introduce any dependency between the two. For example:
with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device), apply_lora(lora, model_on_device, read_only_cpu_sd): ...Something that I think may have been missed in this pseudo-code is that
read_only_cpu_sdis the full CPU state dict, not a bool. This has the advantage that we could save the 1.27s to copy weights to CPU in all cases, without making the user decide whether they want to enableunpatch_lora_weightsbased on their HW and usage patterns.What's the next step?
I think we are converging on a couple of options. A final proposal summarizing those options and the tradeoffs would be helpful.
The one thing that I think is still unknown is whether the current LoRA patching / unpatching is well-optimized. There's a bunch of weight copying intertwined with matmuls in there - it's plausible to me that we could make it 10x faster, but I really have no idea. Optimizing that code might change where the bottlenecks are.
@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.
I understand now. That's very clever and lets us save the step of saving the original weights to CPU. To recap:
- Add a
model_on_device()method to model cache that returns a copy of the state dictionary in CPU. - The SD is passed to the model patcher.
- During model patching, the patcher records the keys (but not the tensors) of the layers that were patched.
- During model unpatching, the patcher copies the tensors from the SD back into the model.
The only drawback I see is that in the event that the model doesn't fit into the VRAM cache, the unpatching step will be redundant because the model will be removed from VRAM.
@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.
I've tried a few small optimizations, including downcasting the lora to the same dtype as the model, not upcasting at the end, and not performing the to(torch.device('cpu')) call. I was able to shave off about 0.03s - not very useful!
It looks like most of the action will be in LoRALayer.get_weight(), which is doing a lot of reshaping and summing. I can profile these steps, but it isn't obvious to me how to optimize them. Do you have an intuition where the biggest payoff will be?
@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.
I've tried a few small optimizations, including downcasting the lora to the same dtype as the model, not upcasting at the end, and not performing the
to(torch.device('cpu'))call. I was able to shave off about 0.03s - not very useful!It looks like most of the action will be in
LoRALayer.get_weight(), which is doing a lot of reshaping and summing. I can profile these steps, but it isn't obvious to me how to optimize them. Do you have an intuition where the biggest payoff will be?
If you use the Torch profiler (https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html - see the section on viewing a trace in the Chrome trace viewer), you should be able to see whether the process is bottlenecked on CPU operations, weight copies, or GPU operations. Ideally, I think we'd want to maximize GPU utilization. If there are opportunities for big improvements, I'd guess that they would come from eliminating unnecessary CUDA synchronizations (https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution). Probably by either changing the sequence of copy and matmul operations, or making use of .to(..., non_blocking=True)
@RyanJDick I've implemented the model_on_device() idea and it seems to be performing as expected. Please take a look.
If acceptable, I propose merging in this optimization and then look into profiling and optimizing further as you suggest.
@RyanJDick I’ve made all the doc fixes you requested, so I think this is ready to launch.
@RyanJDick It looks like I can shave off another 0.1-0.2 s by enabling non-blocking transfers. Here are the timings with the profiler spanning the patch/unpatch loops in apply_lora():
NON_BLOCKING=FALSE
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::to 9.51% 19.967ms 74.39% 156.121ms 30.891us 0.000us 0.00% 25.010ms 4.949us 5054
aten::_to_copy 7.85% 16.468ms 73.68% 154.636ms 30.597us 0.000us 0.00% 29.555ms 5.848us 5054
aten::copy_ 5.59% 11.735ms 66.62% 139.816ms 27.664us 31.916ms 34.01% 31.916ms 6.315us 5054
cudaMemcpyAsync 45.75% 96.013ms 45.75% 96.013ms 33.245us 0.000us 0.00% 0.000us 0.000us 2888
cudaStreamSynchronize 13.88% 29.140ms 13.88% 29.140ms 10.090us 0.000us 0.00% 0.000us 0.000us 2888
aten::matmul 0.60% 1.255ms 6.90% 14.485ms 20.062us 0.000us 0.00% 12.486ms 17.294us 722
aten::mm 4.91% 10.299ms 6.68% 14.011ms 19.406us 14.791ms 15.76% 14.791ms 20.486us 722
aten::empty_strided 4.39% 9.207ms 4.39% 9.207ms 1.822us 0.000us 0.00% 0.000us 0.000us 5054
cudaLaunchKernel 3.26% 6.851ms 3.26% 6.851ms 1.581us 0.000us 0.00% 0.000us 0.000us 4332
aten::mul 1.63% 3.423ms 2.15% 4.505ms 6.240us 27.716ms 29.53% 27.716ms 38.388us 722
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 209.872ms
Self CUDA time total: 93.853ms
NON_BLOCKING=TRUE
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::to 12.37% 16.342ms 73.04% 96.475ms 19.089us 0.000us 0.00% 26.937ms 5.330us 5054
aten::_to_copy 5.95% 7.853ms 71.63% 94.603ms 18.718us 0.000us 0.00% 29.446ms 5.826us 5054
aten::empty_strided 7.80% 10.297ms 48.42% 63.958ms 12.655us 0.000us 0.00% 0.000us 0.000us 5054
cudaHostAlloc 40.66% 53.703ms 40.66% 53.703ms 37.190us 0.000us 0.00% 0.000us 0.000us 1444
aten::copy_ 7.27% 9.606ms 18.76% 24.777ms 4.902us 32.473ms 34.34% 32.473ms 6.425us 5054
cudaMemcpyAsync 8.51% 11.245ms 8.51% 11.245ms 3.894us 0.000us 0.00% 0.000us 0.000us 2888
aten::matmul 1.12% 1.476ms 6.71% 8.857ms 12.267us 0.000us 0.00% 12.405ms 17.181us 722
aten::mm 4.22% 5.572ms 6.47% 8.550ms 11.842us 14.772ms 15.62% 14.772ms 20.460us 722
cudaLaunchKernel 5.31% 7.016ms 5.31% 7.016ms 1.620us 0.000us 0.00% 0.000us 0.000us 4332
aten::mul 3.25% 4.299ms 4.08% 5.383ms 7.456us 27.828ms 29.43% 27.828ms 38.543us 722
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 132.080ms
Self CUDA time total: 94.561ms
I also tried adding non_blocking=True to the model locking, IP_adapter, and textual inversion to() calls, and when I do generations with three LoRAs, one TI and an IP-Adapter I am getting a ~0.3s /generation speed improvement. This is not so dramatic, but probably worth it.
After this gets merged I'll make another PR with the to() optimizations.