InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

LoRA patching optimization

Open lstein opened this issue 1 year ago • 3 comments

Summary

This implements a scheme in which the original weights are not restored after LoRA patching, saving a second or so per LoRA. This only occurs under the following conditions:

  1. The execution device is CUDA
  2. The model in question has the load_state_dict() method used for the optimized RAM->VRAM copy operation.
  3. There is insufficient VRAM cache to hold the model.

Otherwise the original weights are restored as per the previous behavior.

NOTE: The lazy_offloading option has been disabled. Lazy offloading defers the VRAM offloading step till the next time the VRAM cache is accessed, and under some circumstances could have lead to a patched version of the model being reused.

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

  • [x] The PR has a short but descriptive title, suitable for a changelog
  • [x] Tests added / updated (if applicable)
  • [x] Documentation added / updated (if applicable)

lstein avatar May 24 '24 23:05 lstein

Do we have a rough idea of what percentage of loras would be eligible for this improved scheme?

psychedelicious avatar May 24 '24 23:05 psychedelicious

100% of LoRAs. All diffusers and safetensors "main" models will benefit. The only exception are ONNX models.

lstein avatar May 25 '24 01:05 lstein

Oh - I misinterpreted the requirements as being for the lora, instead of the main model. Cool!

psychedelicious avatar May 25 '24 01:05 psychedelicious

I feel like we are choosing a solution here without fully understanding the problem. I mentioned this briefly on Discord here: https://discord.com/channels/1020123559063990373/1049495067846524939/1243665704646082592

The answers to some of those questions might lead us to very different solutions.

RyanJDick avatar May 27 '24 16:05 RyanJDick

NOTE: The lazy_offloading option has been disabled. Lazy offloading defers the VRAM offloading step till the next time the VRAM cache is accessed, and under some circumstances could have lead to a patched version of the model being reused.

Can you explain this decision in more detail, and what the goal is? There are a couple parts of it that don't make sense to me:

  1. From looking at offload_unlocked_models(...), disabling lazy offloading changes the likelihood that models get offloaded, but doesn't change any guarantees around whether models get offloaded. How does this solve the problem you were facing? If the goal is to guarantee that models get offloaded, then we would have to remove the VRAM cache altogether.
  2. In the non-lazy offloading case we call self._cache.offload_unlocked_models(self._cache_entry.size) on offload. The choice of self._cache_entry.size seems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.

The model in RAM is the same size as the model in VRAM. If the model is larger than max_vram_cache then it will be offloaded on the next call to offload_unlocked_models regardless of the capacity size requested in offload_unlocked_models(). Removing lazy_offloading guarantees that offload_unlocked_models will be called before the model is next needed.

lstein avatar May 27 '24 16:05 lstein

In the non-lazy offloading case we call self._cache.offload_unlocked_models(self._cache_entry.size) on offload. The choice of self._cache_entry.size seems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.

You've caught a bug. The size should be zero. We want to offload models that are consuming more space than the VRAM cache limit. In practice this bug makes the VRAM cache less effective.

lstein avatar May 27 '24 17:05 lstein

I think I understand what you are trying to do now. I had missed that has_transient_weights(...) was checking cache_entry.size.

To summarize my understanding: By removing lazy offloading, we are forcing large models ( > max_vram_cache_size ) to be offloaded in slightly more cases than they otherwise would (specifically cases when they are re-used immediately). We do this so that we can accurately predict cases where they will definitely be offloaded, and apply a LoRA optimization in those cases.

Let me think about this some more... there might be a simpler way.

RyanJDick avatar May 27 '24 18:05 RyanJDick

Thinking aloud, I have a few concerns about the current approach:

  • Whether the LoRA optimization gets applied depends on the relative sizes of the model and the VRAM cache. This makes it hard to tune the VRAM cache size, because there are 2 things changing at once (offloading behaviour and lora patching behaviour).
  • has_transient_weights(...) is a low-recall, perfect-precision predictor of whether the weights will be offloaded. I.e. it is really determining whether the weights will definitely be offloaded. There are cases where the weights will be offloaded, but it's hard to predict so we miss out on opportunities to apply the LoRA optimizations.
  • The current heuristic happens to work reasonably well given the relative size of the stable diffusion sub-models, but we shouldn't bake that into our implementation. Imagine, for example, that the VAE was larger than the UNet and TEs. In that context, it's less obvious to me that getting rid of lazy unloading is a good idea.

Brainstorming

There are 2 broad types of model patches: Weight-only: Patches that only modify the weights of the model. These can be unpatched by either an unpatch operation or VRAM offload. Shape: Patches that modify the structure/shape of the model (e.g. TI) - these can only be unpatched by an unpatch operation.

For weight-only patches, when should we unpatch via VRAM offload?

  • If we must offload the model before its next use to make room for another one, then we should unpatch via VRAM offload.
  • If we don't have to offload the model before it's next use, then we should choose the faster of: 1) unpatch operation, 2) VRAM offload + VRAM load.

The main complicating factor is that it can be hard to predict whether a model will need to be offloaded before its next use. On an 4GB GPU, offloads are frequent. On an 8GB GPU, it depends on your workload. On a 16GB GPU, offloads are uncommon.

Option 0 (the current implementation):

  • Downside: At a certain GPU size (8GB? 16GB?) you probably get better results by keeping lazy offload and just making sure your VRAM cache size is large enough.

Option 1: Just make LoRA patching faster

  • What is the slow part of LoRA patching/unpatching? Can we make it 10x faster? Then these unpatch-via-offload optimizations become almost irrelevant.
  • Do we have unnecessary CUDA synchronizations in the patching process?
  • If it's the VRAM-to-CPU copy during patching, could we give the patcher access to the model cache's CPU copy for this?

Option 2: unpatch_lora_by_vram_offload Flag

  • Use a config flag (e.g. unpatch_lora_by_vram_offload) to force LoRA unpatch via offloading globally, irrespective of the VRAM cache size.
  • This makes it much easier to tune configs and know what's going on.
  • We could optionally make this configurable on a per-model basis. If we do that, then I think this is strictly better than Option 0. (We could configure it to behave like Option 0, if desired.)

Option 3: Unpatch stack

  • As patches are applied, register a stack of unpatch operations that must be applied. Only apply them if the model isn't offloaded before it's next use.
  • Handling interactions between weight-only and shape patches would be complicated.
  • If we are willing to add this level of complexity, we might as well figure out how to avoid unnecessary unpatch/repatch as well.

What do you think? Do you agree with the downsides of the current approach? Does this spark any ideas?

RyanJDick avatar May 27 '24 19:05 RyanJDick

I wonder how this is managed in other applications that regularly load and modify large models? I suppose the most accessible prior art would be other image gen applications (comfy, A1111, SD.Next, Forge, Fooocus) - I'd guess LLMs don't manipulate weights as frequently. It could be that this is a solved problem.

I also wonder if memory management should be so opaque to the user. Maybe it's better to lean toward optimizations being configurable with flags, and let the user tune the app. There are bound to be systems or usage patterns where our optimizations are detrimental.

psychedelicious avatar May 27 '24 21:05 psychedelicious

Here's some statistics on LoRA patching time. Timings are for an SDXL model with the "add_detail_xl" and "VoxelXL_v1" LoRAS:

Time to patch the weights in VRAM: 0.70s Time to copy original weights to RAM: 1.27s Time to restore original weights: 0.43s

So the total overhead for weight copying and restoration is 1.7 of a total 2.4s total process, or 71%.

lstein avatar May 27 '24 22:05 lstein

Options 1 and 3 will create a dependency between the model patcher and the model manager, which is something I wanted to avoid for no other reason than it requires passing a pointer to the model manager service through several layers of code and makes it difficult to use the model patcher module outside of the web app.

Option 4: Flag Models that Need to be Offloaded

However, if this is OK architecturally, then I would add a fourth option, which is to allow the patcher to set a dirty flag in the model's cache entry (presumably via a method call to the model manager). This would tell the model manager to unconditionally offload the model from VRAM. This would be coupled to a unpatch_lora_by_vram_offload config flag that gives the user control of this behavior.

[Edit] This preserves the performance benefits of the VRAM cache and lazy_offloading without adding too much complexity.

lstein avatar May 27 '24 22:05 lstein

  1. In the non-lazy offloading case we call self._cache.offload_unlocked_models(self._cache_entry.size) on offload. The choice of self._cache_entry.size seems arbitrary - we don't know how much VRAM we are going to need for the next model. Am I missing something here? This seems broken, but it's not new to your change.

This is addressed in PR #6450 . Most if not all people have lazy_offloading enabled, so this bit of code was rarely called. It would have had the performance impact of clearing the VRAM cache unecessarily.

lstein avatar May 28 '24 00:05 lstein

Here's some statistics on LoRA patching time. Timings are for an SDXL model with the "add_detail_xl" and "VoxelXL_v1" LoRAS:

Time to patch the weights in VRAM: 0.70s Time to copy original weights to RAM: 1.27s Time to restore original weights: 0.43s

So the total overhead for weight copying and restoration is 1.7 of a total 2.4s total process, or 71%.

Building on this, my intuition tells me that "Time to patch the weights in VRAM" grows linearly with number of LoRAs, while the time to copy/restore is constant regardless of number of LoRAs. Does that match your tests? Most people probably use 0-3 LoRAs, so that doesn't change things too much.

Does "Time to patch the weights in VRAM" include time to load the original LoRA state_dict into VRAM? Is that negligible?

For comparison, on this test system, what is the time to load the TEs and UNet from RAM to VRAM?

Options 1 and 3 will create a dependency between the model patcher and the model manager, which is something I wanted to avoid for no other reason than it requires passing a pointer to the model manager service through several layers of code and makes it difficult to use the model patcher module outside of the web app.

For Option 3, I agree. (For the record, I think Option 3 is pretty bad idea 😆 )

For Option 1, I don't think it needs to introduce any dependency between the two. For example:

with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device),
	apply_lora(lora, model_on_device, read_only_cpu_sd):
	...

Should workshop the API, but the point is that any potential coupling can be moved up to the caller.

RyanJDick avatar May 28 '24 13:05 RyanJDick

For Option 1, I don't think it needs to introduce any dependency between the two. For example:

with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device),
	apply_lora(lora, model_on_device, read_only_cpu_sd):
	...

Should workshop the API, but the point is that any potential coupling can be moved up to the caller.

Yeah, I was going to propose something similar, but using a config file setting to decide whether the model should be read only or not, rather than asking the model manager to make the decision. I think that @psychedelicious is right to leave the optimization decision to the user.

Here's my version of the pseudocode:

# in invokeai.yaml
unpatch_lora_weights = false

# in the invocation
read_only_model = not app_config.unpatch_lora_weights
with model_info(read_only_model) as model_on_device:
    apply_lora(lora, model_on_device, read_only_model)

When read_only_model is true, the LoadedModel context manager will set a flag in the cache indicating that the model manager should dispose of the VRAM copy of the model and disable lazy loading for this model. The model patcher will skip the saving/restore steps.

You're quite right that the weight save/restore step is a one-time overhead that doesn't significantly change as you add more LoRAs (it scales with the total number of unique keys that are patched). The patching timing that I reported included the time to move the LoRA weights into VRAM, but I didn't break that out from the time to update the model weights.

What's the next step?

lstein avatar May 28 '24 23:05 lstein

For Option 1, I don't think it needs to introduce any dependency between the two. For example:

with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device),
	apply_lora(lora, model_on_device, read_only_cpu_sd):
	...

Something that I think may have been missed in this pseudo-code is that read_only_cpu_sd is the full CPU state dict, not a bool. This has the advantage that we could save the 1.27s to copy weights to CPU in all cases, without making the user decide whether they want to enable unpatch_lora_weights based on their HW and usage patterns.

What's the next step?

I think we are converging on a couple of options. A final proposal summarizing those options and the tradeoffs would be helpful.

The one thing that I think is still unknown is whether the current LoRA patching / unpatching is well-optimized. There's a bunch of weight copying intertwined with matmuls in there - it's plausible to me that we could make it 10x faster, but I really have no idea. Optimizing that code might change where the bottlenecks are.

@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.

RyanJDick avatar May 30 '24 13:05 RyanJDick

For Option 1, I don't think it needs to introduce any dependency between the two. For example:

with model_cache.model_on_device(model_info) as (read_only_cpu_sd, model_on_device),
	apply_lora(lora, model_on_device, read_only_cpu_sd):
	...

Something that I think may have been missed in this pseudo-code is that read_only_cpu_sd is the full CPU state dict, not a bool. This has the advantage that we could save the 1.27s to copy weights to CPU in all cases, without making the user decide whether they want to enable unpatch_lora_weights based on their HW and usage patterns.

What's the next step?

I think we are converging on a couple of options. A final proposal summarizing those options and the tradeoffs would be helpful.

The one thing that I think is still unknown is whether the current LoRA patching / unpatching is well-optimized. There's a bunch of weight copying intertwined with matmuls in there - it's plausible to me that we could make it 10x faster, but I really have no idea. Optimizing that code might change where the bottlenecks are.

@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.

I understand now. That's very clever and lets us save the step of saving the original weights to CPU. To recap:

  1. Add a model_on_device() method to model cache that returns a copy of the state dictionary in CPU.
  2. The SD is passed to the model patcher.
  3. During model patching, the patcher records the keys (but not the tensors) of the layers that were patched.
  4. During model unpatching, the patcher copies the tensors from the SD back into the model.

The only drawback I see is that in the event that the model doesn't fit into the VRAM cache, the unpatching step will be redundant because the model will be removed from VRAM.

lstein avatar Jun 02 '24 13:06 lstein

@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.

I've tried a few small optimizations, including downcasting the lora to the same dtype as the model, not upcasting at the end, and not performing the to(torch.device('cpu')) call. I was able to shave off about 0.03s - not very useful!

It looks like most of the action will be in LoRALayer.get_weight(), which is doing a lot of reshaping and summing. I can profile these steps, but it isn't obvious to me how to optimize them. Do you have an intuition where the biggest payoff will be?

lstein avatar Jun 03 '24 02:06 lstein

@lstein if you're interested in looking into optimizing the existing LoRA patching code it could make sense to start there. Otherwise, I think we could proceed with a final a proposal based on what we currently know.

I've tried a few small optimizations, including downcasting the lora to the same dtype as the model, not upcasting at the end, and not performing the to(torch.device('cpu')) call. I was able to shave off about 0.03s - not very useful!

It looks like most of the action will be in LoRALayer.get_weight(), which is doing a lot of reshaping and summing. I can profile these steps, but it isn't obvious to me how to optimize them. Do you have an intuition where the biggest payoff will be?

If you use the Torch profiler (https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html - see the section on viewing a trace in the Chrome trace viewer), you should be able to see whether the process is bottlenecked on CPU operations, weight copies, or GPU operations. Ideally, I think we'd want to maximize GPU utilization. If there are opportunities for big improvements, I'd guess that they would come from eliminating unnecessary CUDA synchronizations (https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution). Probably by either changing the sequence of copy and matmul operations, or making use of .to(..., non_blocking=True)

RyanJDick avatar Jun 03 '24 13:06 RyanJDick

@RyanJDick I've implemented the model_on_device() idea and it seems to be performing as expected. Please take a look.

If acceptable, I propose merging in this optimization and then look into profiling and optimizing further as you suggest.

lstein avatar Jun 03 '24 21:06 lstein

@RyanJDick I’ve made all the doc fixes you requested, so I think this is ready to launch.

lstein avatar Jun 06 '24 01:06 lstein

@RyanJDick It looks like I can shave off another 0.1-0.2 s by enabling non-blocking transfers. Here are the timings with the profiler spanning the patch/unpatch loops in apply_lora():

NON_BLOCKING=FALSE
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::to         9.51%      19.967ms        74.39%     156.121ms      30.891us       0.000us         0.00%      25.010ms       4.949us          5054  
                                         aten::_to_copy         7.85%      16.468ms        73.68%     154.636ms      30.597us       0.000us         0.00%      29.555ms       5.848us          5054  
                                            aten::copy_         5.59%      11.735ms        66.62%     139.816ms      27.664us      31.916ms        34.01%      31.916ms       6.315us          5054  
                                        cudaMemcpyAsync        45.75%      96.013ms        45.75%      96.013ms      33.245us       0.000us         0.00%       0.000us       0.000us          2888  
                                  cudaStreamSynchronize        13.88%      29.140ms        13.88%      29.140ms      10.090us       0.000us         0.00%       0.000us       0.000us          2888  
                                           aten::matmul         0.60%       1.255ms         6.90%      14.485ms      20.062us       0.000us         0.00%      12.486ms      17.294us           722  
                                               aten::mm         4.91%      10.299ms         6.68%      14.011ms      19.406us      14.791ms        15.76%      14.791ms      20.486us           722  
                                    aten::empty_strided         4.39%       9.207ms         4.39%       9.207ms       1.822us       0.000us         0.00%       0.000us       0.000us          5054  
                                       cudaLaunchKernel         3.26%       6.851ms         3.26%       6.851ms       1.581us       0.000us         0.00%       0.000us       0.000us          4332  
                                              aten::mul         1.63%       3.423ms         2.15%       4.505ms       6.240us      27.716ms        29.53%      27.716ms      38.388us           722  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 209.872ms
Self CUDA time total: 93.853ms

NON_BLOCKING=TRUE
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::to        12.37%      16.342ms        73.04%      96.475ms      19.089us       0.000us         0.00%      26.937ms       5.330us          5054  
                                         aten::_to_copy         5.95%       7.853ms        71.63%      94.603ms      18.718us       0.000us         0.00%      29.446ms       5.826us          5054  
                                    aten::empty_strided         7.80%      10.297ms        48.42%      63.958ms      12.655us       0.000us         0.00%       0.000us       0.000us          5054  
                                          cudaHostAlloc        40.66%      53.703ms        40.66%      53.703ms      37.190us       0.000us         0.00%       0.000us       0.000us          1444  
                                            aten::copy_         7.27%       9.606ms        18.76%      24.777ms       4.902us      32.473ms        34.34%      32.473ms       6.425us          5054  
                                        cudaMemcpyAsync         8.51%      11.245ms         8.51%      11.245ms       3.894us       0.000us         0.00%       0.000us       0.000us          2888  
                                           aten::matmul         1.12%       1.476ms         6.71%       8.857ms      12.267us       0.000us         0.00%      12.405ms      17.181us           722  
                                               aten::mm         4.22%       5.572ms         6.47%       8.550ms      11.842us      14.772ms        15.62%      14.772ms      20.460us           722  
                                       cudaLaunchKernel         5.31%       7.016ms         5.31%       7.016ms       1.620us       0.000us         0.00%       0.000us       0.000us          4332  
                                              aten::mul         3.25%       4.299ms         4.08%       5.383ms       7.456us      27.828ms        29.43%      27.828ms      38.543us           722  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 132.080ms
Self CUDA time total: 94.561ms

I also tried adding non_blocking=True to the model locking, IP_adapter, and textual inversion to() calls, and when I do generations with three LoRAs, one TI and an IP-Adapter I am getting a ~0.3s /generation speed improvement. This is not so dramatic, but probably worth it.

After this gets merged I'll make another PR with the to() optimizations.

lstein avatar Jun 06 '24 13:06 lstein