ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

ModelPatcher Overhaul and Hook Support

Open Kosinkadink opened this issue 1 year ago • 14 comments

This PR merges all changes from improved_memory branch, and expands ModelPatcher + transformer_options to allow for different weights and properties to be applied for selected conditioning. This is done by introducing a hook design pattern, where conditioning and CLIP can have hooks attached to change their behavior at sample time; before, this was hardcoded for specific things like controlnet and gligen.

I did not find any memory or performance regression in my testing, but more testing would be good; I will try to get some folks to test out this branch alongside the corresponding rework_modelpatcher branches in AnimateDiff-Evolved and Advanced-ControlNet that make use of the new functionality.

Remaining TODO:

  • [x] Fix VRAM usage exceeding expectations when applying weight hooks
    • SOLVED: weights needed to be copied in place, and the weight backups should be copied to avoid being overriden
  • [ ] Make sure weights loaded in lowvram mode get modified appropriately by WeightHooks
  • [ ] Figure out why flux LoRAs do not produce the same results when applied via Hook compared to with Load LoRA node (issue does not exist for other model loras as far as I can tell)

Breaking Changes:

  • ControlNet's get_control function now takes transformer_options as a required parameter; if a custom node wrote its own function to overwrite the built-in calc_cond_batch function, it will result in an error when executing. It will be an easy fix for any affected nodes; only one I can think of on the top of my head is TiledDiffusion.

Features:

  • ModelPatcher additions
    • Weight Hooks
      • Different lora and model-as-lora weights can be attached and scheduled on CLIP/conditioning.
    • Wrappers
      • Instead of requiring nodes to overwrite or hack into important functions to change their functionality, ModelPatcher and model_options support wrappers functions that will be automatically handle passing an executor into wrapper functions to facilitate wrapping in a predictable manner. Since there is no limitation on names of wrapper functions, some custom nodes could decide to expose extending their own functionality with other nodes through the wrapper system. Cost of wrapping is imperceptibly low, so more wrapper support can be added upon need/request.
    • Callbacks
      • Similar to wrappers, but callbacks instead can be used to extend ModelPatcher functions to avoid the need for hacky ModelPatcher inheritance, for cases where wrapping wouldn't make sense. Same as wrappers, more callbacks can be added upon need/request.

Kosinkadink avatar Nov 11 '24 18:11 Kosinkadink

I gave this a very quick spin since I'm keen on getting rid of hacks in prompt control; I had some OOM issues, but I think those happened because the backup weights seem to be kept in VRAM and sometimes that's too much for my 12GB GPU.

I have some questions though:

  • What's the intended use case for CreateHookLora and RegisterHookLora? The latter looks like it adds the LoRA directly to the modelpatcher, but there doesn't seem to be a way to set keyframes for it without passing things through CondPairSetProps anyway, and both of them seem to do something at least.
  • The Create Hook Keyframes from Floats node takes a FLOATS input that no node seems to provide. Is that just missing for now?
  • Can we have some debug-level logging for when hooks actually get called so it's easier to tell what is happening?

asagi4 avatar Nov 11 '24 20:11 asagi4

The Register nodes will be removed later today before I take this PR off of draft, I had those early on to double check behavior was consistent with my original implementation of this from AnimateDiff-Evolved. I will also include some example workflows that use these features. I just wanted to make the draft to have a page to edit as I add more details.

As for the VRAM usage for backup, there is code there to see if there is enough VRAM available to cache weights in VRAM vs offloading, I will review it again on my 8GB machine to see if there is something lacking with it. My hunch is the code would need to be more conservative about how much VRAM it should deem acceptable to use before switching to offloading the cached weights.

The FLOATS output is currently only done by a couple nodes, including the Value Scheduling node from AnimateDiff-Evolved. I wanted to differentiate it from FLOAT since it is a list of floats, and slowly transition away from the popular current approach in the community to output a FLOAT type with a list instead of a single float.

Kosinkadink avatar Nov 11 '24 20:11 Kosinkadink

@Kosinkadink a separate input type sounds reasonable, though in that case the PR should include a node to create one. Otherwise you'll probably have lots of people like me asking what it's about if they don't have AnimateDiff installed.

asagi4 avatar Nov 11 '24 20:11 asagi4

I've been using this branch for a few days without any issues. Not very scientific or anything, but I tend to use complicated workflows with a lot of custom nodes so I think I'd be more likely to run into a problem than the average person. (Nothing that actually touches the new features added by this pull, though.)

Not sure if it's within the scope of this pull, but something like hooks/interceptable events for stuff like sampling start is sorely needed. There are lots of things that care about steps and custom nodes currently have to do some very hacky stuff like monkeypatching the base sampler function if they want to even know when sampling started.

blepping avatar Nov 18 '24 01:11 blepping

@blepping As part of this PR, the CFGGuider class has a couple wrappers exposed for it, notably the OUTER_SAMPLER wrapper: https://github.com/comfyanonymous/ComfyUI/pull/5583/files#r1846731963

All samplers use the CFGGuider class and will be going through this pathway, and there was no way previously for custom nodes to cleanly modify its behavior as its a class method. I have corresponding PRs for AnimateDiff-Evolved and Advanced-ControlNet that make use of this wrapper among other features in these PR: https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved/pull/498 https://github.com/Kosinkadink/ComfyUI-Advanced-ControlNet/pull/198

At that stage of sampling prep, steps are abstracted out to individual sigmas; only KSampler and KSampler (Advanced) pathways have any knowledge of the steps used to get the sigmas. A simple solution for modifying the behavior of these two step samplers via custom nodes would be to add a new wrapper endpoint in this PR or the next for KSampler.sample function in samplers.py, as the sample.sample function calls it to perform the sampling, and would extend the behavior towards any third-party KSampler node that makes use of the KSampler class internally. Modifying the model_options inside this wrapper function would allow 'sharing' the steps corresponding to the sigmas within the OUTER_SAMPLE wrapper.

Kosinkadink avatar Nov 18 '24 15:11 Kosinkadink

@asagi4 I think I've cracked down on the memory issue you noticed, it should no longer result in OOMs.

Kosinkadink avatar Nov 18 '24 15:11 Kosinkadink

As part of this PR, the CFGGuider class has a couple wrappers exposed for it, notably the OUTER_SAMPLER wrapper

Nice! Just knowing when sampling starts + getting a list of sigmas will be extremely useful (and even for multistep samplers, it's possible to guess what step a sigma corresponds to even if it's in between the steps).

A simple solution for modifying the behavior of these two step samplers via custom nodes would be to add a new wrapper endpoint in this PR

Not entirely sure I'm understanding correctly, but I don't think it could really be done with a wrapper. Internally, samplers usually just have a for loop over the sigmas. I think it would need to be something more like an event the sampler voluntarily broadcasts to interested listeners.

model.broadcast_event("sampler_step_start", {"x": x, "step": step})

Very simple sketch of the approach, not necessarily the information you'd want to pass to it.

blepping avatar Nov 18 '24 15:11 blepping

Ah, if you mean getting access to the internals of the sampling functions in comfy.k_diffusion.sampling.py, that likely would be best implemented in a more thorough callback function via the new callbacks feature in a future PR. The 'callback' function that currently exists there is mostly for tracking info and is a bit of a pain to repurpose for other things.

Kosinkadink avatar Nov 18 '24 15:11 Kosinkadink

@Kosinkadink Yeah, I did some basic testing and got no OOMs.

Am I interpreting the "Set CLIP hooks" node correctly in that if "schedule_clip" is enabled, then LoRA hooks are applied to the text encoder, and if "apply_to_conds" is enabled, the hooks get applied to the diffusion model at sampling time?

Actually, how does CLIP scheduling work with this system, since when the text encoder node runs, no information is available to decide what hooks should apply? I guess I could go read the code but there's a lot of it. :-)

asagi4 avatar Nov 18 '24 16:11 asagi4

if you mean getting access to the internals of the sampling functions in comfy.k_diffusion.sampling.py, that likely would be best implemented in a more thorough callback function via the new callbacks feature in a future PR.

Well, I meant in general, not necessarily just the ones built in to ComfyUI. I have some node packs that publish custom samplers also, so if a feature like that existed I'd definitely use it in my own stuff as well.

Fair enough about it being something for a future PR though. Thanks for the response!

blepping avatar Nov 18 '24 16:11 blepping

@asagi4 CLIP scheduling works by encoding a version of the conditioning for each value in the schedule, and uses clip_start_percent and clip_end_percent to determine at sample time which conds are irrelevant to a particular step in sampling (and works together with the existing start_percent and end_percent as expected). Code for that is found here: https://github.com/comfyanonymous/ComfyUI/pull/5583/files#r1846935612

For the Set CLIP Hooks node, as long as hooks are connected to it, they will be applied when encoding text.

However, when schedule_clip is set to false, the clip strength will not be modified by any connected hook keyframes. Reason why I leave this default off is that as you can infer from my description of how the clip scheduling works, each value in the schedule requires an additional encoding of text. 10 different keyframes would mean 10x the time required to encode the conds, since 10 conds would be added with different clip_start_percent/clip_end_percent. Also from testing, scheduling the clip strength hasn't produced any results that looked any significantly better than just applying their unscheduled values, but I know it's something peeps would likely find a use for.

And apply_to_conds adds hooks to the resulting conditioning so they can be applied to the model at sample time. I added this so that in the long term it will not be necessary to connect hooks to both Set Clip Hooks node and Cond/Cond Pair Set Props nodes. Putting them on both is still fine, and can be mixed and matched, but from my own testing it is easy to forget to put the hooks on the conds after putting them on the CLIP, so this is meant as a quality of life feature.

Only caveat is that for existing custom nodes that do text encoding, whether apply_to_conds applies without editing any code on their end depends on how they call clip.encode_from_tokens(). If the call has return_dict=True, it will automatically include hooks in the returned dict as expected for sampling. Otherwise, they'd need to modify their code with a clip.add_hooks_to_dict() that I added for easy handling. And to handle schedule_clip automatically, that is done by the clip.encode_from_tokens_scheduled() function that will return the finalized conds list with the hooks and clip_start/clip_end stuff included.

Kosinkadink avatar Nov 18 '24 17:11 Kosinkadink

@Kosinkadink ah, I see, so it sounds like it might end up sometimes encoding prompts unnecessarily if prompt editing is in use, if say, you want at 50% switch to another prompt with a different set of hooks, but it sounds like it shouldn't be too difficult to work around that if it ever becomes an issue for performance.

Would it be possible to add optional start_pct and end_pct parameters (or equivalent) to encode_from_tokens_scheduled so that callers can specify what range of hooks they need to apply?

asagi4 avatar Nov 18 '24 18:11 asagi4

Sure, I'll add em as optional parameters to the function.

Kosinkadink avatar Nov 18 '24 18:11 Kosinkadink

@asagi4 I made an addition to the encode_from_tokens_scheduled code to use the start_percent and end_percent in the provided optional add_dict parameter to skip encoding any conds that would fall outside that range. And since these params are in the add_dict, the start_percent and/or end_percent will automatically be included in the final cond dicts; two birds with one stone. There is not much reason to limit the scheduled clip that gets encoded without including the start_percent and end_percent that limited them in the same conds, so I think it makes sense.

Kosinkadink avatar Nov 18 '24 18:11 Kosinkadink

I'm trying to implement a basic version of prompt control on top of this mechanism and it seems encode_from_tokens_scheduled throws an NPE if there are no hooks set. It would be convenient for it to gracefully fall back to regular encoding in that case.

I'm also wondering how this is going to interact with providing different types of weight interpretations (like advanced clip encode) but adding a better mechanism for those might belongs in another PR.

asagi4 avatar Nov 19 '24 20:11 asagi4

@asagi4 Thanks for the feedback, I've made encode_from_tokens_scheduled now work if no hooks are present.

Assuming custom weight interpretations are implemented around modifying the cond_state_model methods, everything should work. I'm not familiar with the inner working of any custom nodes that do this, so I can't give any good feedback on that front at this time. Chances are that if it is excessively difficult for those custom nodes to do their thing without hacky solution currently, there should be a better way to do so via a future PR.

Kosinkadink avatar Nov 19 '24 23:11 Kosinkadink

All of my manual testing is complete - the PR can be merged at any time if all looks fine with @comfyanonymous

Kosinkadink avatar Nov 24 '24 22:11 Kosinkadink

@Kosinkadink If I create a workflow that sets a LoRA hook with no scheduling, it seems that it reloads the hook on every sampling run even if just the sampler seed changes. It causes pretty significant latency. I think it's because the patches get unloaded immediately after sampling even though there's no need to do so.

Is there a way to use the hook mechanism in a way that avoids this at the moment? My quick testing shows a 13.8s to 16.5s increase when only changing the seed after warmup (sometimes even up to 18s). --highvram reduces the latency quite a bit but still doesn't remove it

asagi4 avatar Nov 26 '24 21:11 asagi4

A lot of my current optimization was focused on the worst case scenarios of different hooks needing to be applied to different conditioning, so to prevent any memory issues from cached weights not being cleared, I currently have the model purge newly registered (AKA, added at sample time) hook patches and clear cached hooked weight calculations, always.

In cases where there is only a single hook group to apply, I could make it not revert the model to its unhooked state at the end of sampling, so that if nothing gets changed with the hooks/ModelPatcher, it would not need to redo hooked weight application. However, that introduces some extra complexity that could introduce bugs I don't want to deal with currently - I've been working on this for 3 months, and in its current state it hasn't even been released to be tested by a wide variety of peeps. Once it gets merged and it appears to be working fine in general, I'd be down to add an optimization for that edge case.

Kosinkadink avatar Nov 26 '24 22:11 Kosinkadink

@Kosinkadink Fair. This PR is definitely big enough already.

asagi4 avatar Nov 26 '24 22:11 asagi4

Thanks for the considerable effort it must have taken to make this so readable. That is tremendously helpful.

wbclark avatar Dec 16 '24 22:12 wbclark

@Kosinkadink I've tried your lorahookmasking workflow, using a single checkpoint and one lora file for each Set Clip Hooks node.

While it did manage to perfectly avoid lora bleeding, the generation time is 10x longer than usual for a 12-step 880x768px image:

got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using xformers attention in VAE
Using xformers attention in VAE
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
100%|█████████████████████████████████| 12/12 [01:43<00:00,  8.65s/it]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
Prompt executed in 132.59 seconds

Also, RAM usage skyrockets immediately as shown here.

ram usage

The model is PrefectPonyV2XL (6.4GB) and the loras have been shrinked to 70 MB each, but this didn't help at all. Is there some explanation on why this is so slow? The workflow is exactly your original one, but loading 2 loras instead of 1 lora and 1 checkpoint as lora.

andreszs avatar Dec 20 '24 14:12 andreszs

#6537 Can anyone solve this,gguf version Flux not support

Jeremy-ttt avatar Jan 31 '25 21:01 Jeremy-ttt

@Kosinkadink do you have any guidance/blogs/emails/angry notes about how PatcherInjection is meant to work? i peeked in your ADE code but it seems pretty focused on ADE things where this feature is concerned, looking for more generalized "what should my inject/eject functions do/not do?" especially where one might be putting layers/modules into the model and then removing them after a step or after a whole run...

krahnikblis avatar Jul 13 '25 22:07 krahnikblis