ComfyUI
ComfyUI copied to clipboard
Slow generation times in Flux, using loras ( fixed by using GGUF models or XLabs loras )
Actual Behavior
Using loras with Flux is very slow. This is independent from the lora size.
But performance is good, using loras, if one of these conditions is met:
-
XLabs loras are used
-
GUFF models are used ( even the larger Q8_0 )
-
--reserve-vram 1.2is used - AND - only a single lora is used
More details, in the previous thread : https://github.com/comfyanonymous/ComfyUI/issues/4618
Initially, I thought this issue was related to lora size, but this is not the case.
These conditions are independent from lora size:
For instance, --reserve-vram 1.2 will work for one 1.28 GB lora, but not for two 19 MB loras.
Other
- GTX 1070 ( 8GB )
- 32 GB RAM
- Windows 10
- pytorch version: 2.3.1+cu121
It case it helps, @city96 made some observations:
- about XLabs loras using one merged qkv key, instead of separate q/k/v keys:
https://github.com/city96/ComfyUI-GGUF/issues/33#issuecomment-2310735260
- about the way GUFF models handle multiple loras:
They weights are kept in CPU memory and are loaded in one by one instead of being applied on load or whatever the default for fp8 is. Technically it's slower but I decided to use this method as multiple LoRAs would take up more and more VRAM.
https://github.com/city96/ComfyUI-GGUF/issues/64#issuecomment-2307772571
If this issue only occurs when using gguf models, it would be good to specify 'gguf' in the title.
This way, users experiencing similar issues can clearly find which issue thread they should check for their case.
No, it's the other way round. The issue never happens with GGUF ( and also never happens with XLabs loras ). I changed the title to be more specific.
Here are some of the loras I used to test:
These civitai loras are always fast with GGUF models ( even using large loras and multiple loras )
They are very slow in Flux FP8/FP16 ( but they are fast, if I use --reserve-vram 1.2, and a single lora ).
https://civitai.com/models/672963/phlux-photorealism-with-style-incredible-texture-and-lighting?modelVersionId=753339
https://civitai.com/models/651187?modelVersionId=728513
https://civitai.com/models/639937/boreal-fd-boring-reality-flux-dev-lora
https://civitai.com/models/641309/formcorrector-anatomic?modelVersionId=717317
XLabs loras are allways fast ( even using large loras and multiple loras ) https://huggingface.co/XLabs-AI/flux-lora-collection/tree/main
In Summary:
-
Flux model + XLabs loras => FAST
-
GUFF model + loras => FAST
-
Flux FP8 + civitai loras => SLOW
I was able to use the new InstantX ControlNet Canny, with --reserve-vram 2.0 ( although I just tried with the Q4_K_S model ).
So now I tried, --reserve-vram 2.0 + Flux FP8 + 2 civitai loras ( 292 + 164 MB ) and it worked, with normal speeds, with a normal GPU usage of 7.4 ~ 7.5 GB ( over 25 steps ).
But is this a permanent solution, though ? If I want to do a more demanding task, do I need to manually test the amount of VRAM I need to reserve ? Also, I'm reserving 25 % of my VRAM already. How much margin do I have left ?
Maybe, keep this issue open, to explore what makes GGUF and XLabs loras work without the need to reserve VRAM. It's not just the model size.
- The GGUF Q8_0 model it's larger than FP8 and it works better.
- The MB XLabs Art lora ( 342 MB ), works better than a civitai 19 MB lora.
I was able to use the new InstantX ControlNet Canny, with
--reserve-vram 2.0( although I just tried with the Q4_K_S model ).So now I tried,
--reserve-vram 2.0+ Flux FP8 + 2 civitai loras ( 292 + 164 MB ) and it worked, with normal speeds, with a normal GPU usage of 7.4 ~ 7.5 GB ( over 25 steps ).But is this a permanent solution, though ? If I want to do a more demanding task, do I need to manually test the amount of VRAM I need to reserve ? Also, I'm reserving 25 % of my VRAM already. How much margin do I have left ?
Maybe, keep this issue open, to explore what makes GGUF and XLabs loras work without the need to reserve VRAM. It's not just the model size.
- The GGUF Q8_0 model it's larger than FP8 and it works better.
- The MB XLabs Art lora ( 342 MB ), works better than a civitai 19 MB lora.
If your VRAM is relatively small, around 8GB, I think it would be better to go with --disable-smart-memory to clear out as much VRAM as possible, rather than using --reserve-memory.
--disable-smart-memory by itself does not seem to work ( for the 2 loras workflow )
Maybe it could work in conjunction with a smaller amount of VRAM reserve ? ( I did not test that )
By the way, I made some tests with a more complex workflow ( Flux Q4_K_S + InstantX ControlNet Canny + Loras ) ( size 1024 x 1024 and 25 steps )
-
--disable-smart-memory-- ( I didn't tried with this workflow ) -
--reserve-vram 2.0-- works fine with an XLab lora. But with 1 civitai lora ( 292 MB ) performance is very slow, with very high GPU usage and VRAM usage. -
--reserve-vram 2.4-- it works fine with 2 civitai loras ( 292 MB + 164 MB ). VRAM usage is only 6.9 GB. Generation times and GPU activity are normal.
reserve-vram is like a silver bullet !
So, how much VRAM can be reserved ?
EDIT: Apparently we can reserve at least 50% VRAM without much slowdown ! ( https://github.com/comfyanonymous/ComfyUI/issues/4693#issuecomment-2322631950 )
I'm having the same issue.
It doesn't matter if 1 or 2 Loras, it depends on the lora. I don't know how to evaluate the type of lora, but this one, for example, using flux.d model causes super slowdown: https://civitai.com/models/645425/flux-syntheticanime
I'm using a RTX4070 TI super 16gb
I tried --reserve-vram 2.4 - nothing changed
This Lora, for example, works fine: https://civitai.com/models/128568/cyberpunk-anime-style
It doesn't matter if 1 or 2 Loras, it depends on the lora. I don't know how to evaluate the type of lora, but this one, for example, using flux.d model causes super slowdown: https://civitai.com/models/645425/flux-syntheticanime
I also have that lora and it works for me ( with --reserve-vram ).
This lora is also in huggingface, with more info. It says it's a 32 rank lora. https://huggingface.co/dataautogpt3/FLUX-SyntheticAnime
The same creator also as a 16 rank lora. See if that one works for you ? https://huggingface.co/dataautogpt3/FLUX-AestheticAnime
There is an issue, related to 64 rank loras, here: ( https://github.com/comfyanonymous/ComfyUI/issues/4681 )
@JorgeR81 Both worked this time, but first I generated a few images without any loras, then enabled it (one). And they worked(separately).
I'll spend more time on testing to be sure it is not random, but I remember seeing you say something about it (the generation order). I'm not using any reserve-vram.
Yes, in a test I did, a single lora worked, if I first generated without a lora.
This was without --reserve-vram
I think this works because it allows the models to be loaded one at a time.
So maybe try : no loras => 1 lora => 2 loras
With --reserve-vram ( in the right amount ), I was always able to generate with loras, on the first run.
@JorgeR81 Did not use --reserve, but started no lora, then 1 lora, then 2, and it worked. I'll try with different loras.
Experiencing the exact same issue.
I'm on a 4090 using
- flux1-dev-Q4_0.gguf
- f5-v1_1-xxl-encoder-Q8_0.gguf
- Cuda 12.4.1, Torch 2.4.1
Without any LoRA I am generating a 720x1280 image at around 1.4 it/s
However using a LoRA downloaded from CivitAI (just ~20 mb), the speed will drop to about 1.3s per it which is about 50% slower.
I don't see this has been resolved, and its making ComfyUI completely unusable with any Flux model (without any loras at all). I have an RTX4090 and the GPU is at 100% and VRAM is at 98%. Nothing else is running?!?!
Use the FP8 models instead of GGUF if your Vram allows. Using GGUF will lower your inference speed by at least 50%, even more if you use multiple Loras. I learned that the hard way.