Allow loading same model twice in memory for refiner, avoid speed penalty when using <base> and <refiner>
Feature Idea
When using lora in a prompt, it's necessary to use <base> and <refiner> to limit the lora effect to only the base:
woman, solo, sitting, chair, blouse, pants, big smile, feet, v, squinting, foreshortening,
masterpiece, best quality, highres, newest, year 2024, absurdres
<base>
<lora:illustrious/style/-Pinkie-_-_Retro_Anime_⭐-_-ILLUSTRIOUS-_-_ILLUSTRIOUS>
<lora:illustrious/style/vixon/Incase_+_Vixon's_Gothic_Neon_+_Disney_for_Illustrious_Style_-_v2-0>
<refiner>
<segment:face,0.2,0.4>
This allows the refiner to run without applying lora. However, if the same model is used (as default behavior), there's a speed penalty where the model must be load with and without the lora, in the same image generation process.
After some testing, it was found that making a copy of the model's .safetensor file on the filesystem, and selecting it as the "Refiner Model" in the "Refiner Param Overrides" completely avoids this speed penalty.
Context and details are available on Discord, here.
Example settings:
Note that the same Hyphoria model is used, but they're in different files.
Using the same model with <base> and <refiner> : generation_time: 6.09 sec (speed penalty)
Making a copy of the model's .safetensor file on filesystem, using the copy as the refiner, and using <base> and <refiner>: generation_time: 3.97 sec (speed penalty avoided)
This is a feature request for allowing loading the same model file into memory separately, one for base and one for refiner, to avoid this speed penalty.
Other
No response
After some testing it appears the dual-model loading happens within comfy itself, and not in swarm. Maybe swarm could handle making a copy of the model's .safetensor file at runtime?
Note that the speed benefit is on an RTX 5090 so 2 seconds per image might not seem like a large enough number to justify this but over large batches the numbers add up.
Did some testing, Comfy is trying to be smart about caching and not load the model twice, but then has to apply lora twice -- you can't even have a second model loader node, it'll outsmart that and still only load the model once. Duplicating the model file is enough to break the cache. There might be a way to break the cache without a duplicate file tho... or it could be reported to comfy as a feature request to allow intentional control over the cache rather than hiding it somewhere in internals