ComfyUI Slow generation times in Flux, using loras ( fixed by using GGUF models or XLabs loras )

Actual Behavior

Using loras with Flux is very slow. This is independent from the lora size.

But performance is good, using loras, if one of these conditions is met:

XLabs loras are used
GUFF models are used ( even the larger Q8_0 )
--reserve-vram 1.2 is used - AND - only a single lora is used

More details, in the previous thread : https://github.com/comfyanonymous/ComfyUI/issues/4618 Initially, I thought this issue was related to lora size, but this is not the case. These conditions are independent from lora size: For instance, --reserve-vram 1.2 will work for one 1.28 GB lora, but not for two 19 MB loras.

Other

GTX 1070 ( 8GB )
32 GB RAM
Windows 10
pytorch version: 2.3.1+cu121

Aug 29 '24 10:08 JorgeR81

It case it helps, @city96 made some observations:

about XLabs loras using one merged qkv key, instead of separate q/k/v keys:

https://github.com/city96/ComfyUI-GGUF/issues/33#issuecomment-2310735260

about the way GUFF models handle multiple loras:

They weights are kept in CPU memory and are loaded in one by one instead of being applied on load or whatever the default for fp8 is. Technically it's slower but I decided to use this method as multiple LoRAs would take up more and more VRAM.

https://github.com/city96/ComfyUI-GGUF/issues/64#issuecomment-2307772571

Aug 29 '24 10:08 JorgeR81

If this issue only occurs when using gguf models, it would be good to specify 'gguf' in the title.

This way, users experiencing similar issues can clearly find which issue thread they should check for their case.

Aug 29 '24 10:08 ltdrdata

No, it's the other way round. The issue never happens with GGUF ( and also never happens with XLabs loras ). I changed the title to be more specific.

Here are some of the loras I used to test:

These civitai loras are always fast with GGUF models ( even using large loras and multiple loras ) They are very slow in Flux FP8/FP16 ( but they are fast, if I use --reserve-vram 1.2, and a single lora ). https://civitai.com/models/672963/phlux-photorealism-with-style-incredible-texture-and-lighting?modelVersionId=753339 https://civitai.com/models/651187?modelVersionId=728513 https://civitai.com/models/639937/boreal-fd-boring-reality-flux-dev-lora https://civitai.com/models/641309/formcorrector-anatomic?modelVersionId=717317

XLabs loras are allways fast ( even using large loras and multiple loras ) https://huggingface.co/XLabs-AI/flux-lora-collection/tree/main

In Summary:

Flux model + XLabs loras => FAST
GUFF model + loras => FAST
Flux FP8 + civitai loras => SLOW

Aug 29 '24 11:08 JorgeR81

I was able to use the new InstantX ControlNet Canny, with --reserve-vram 2.0 ( although I just tried with the Q4_K_S model ).

So now I tried, --reserve-vram 2.0 + Flux FP8 + 2 civitai loras ( 292 + 164 MB ) and it worked, with normal speeds, with a normal GPU usage of 7.4 ~ 7.5 GB ( over 25 steps ).

But is this a permanent solution, though ? If I want to do a more demanding task, do I need to manually test the amount of VRAM I need to reserve ? Also, I'm reserving 25 % of my VRAM already. How much margin do I have left ?

Maybe, keep this issue open, to explore what makes GGUF and XLabs loras work without the need to reserve VRAM. It's not just the model size.

The GGUF Q8_0 model it's larger than FP8 and it works better.
The MB XLabs Art lora ( 342 MB ), works better than a civitai 19 MB lora.

Aug 30 '24 08:08 JorgeR81

I was able to use the new InstantX ControlNet Canny, with --reserve-vram 2.0 ( although I just tried with the Q4_K_S model ).

So now I tried, --reserve-vram 2.0 + Flux FP8 + 2 civitai loras ( 292 + 164 MB ) and it worked, with normal speeds, with a normal GPU usage of 7.4 ~ 7.5 GB ( over 25 steps ).

But is this a permanent solution, though ? If I want to do a more demanding task, do I need to manually test the amount of VRAM I need to reserve ? Also, I'm reserving 25 % of my VRAM already. How much margin do I have left ?

Maybe, keep this issue open, to explore what makes GGUF and XLabs loras work without the need to reserve VRAM. It's not just the model size.

The GGUF Q8_0 model it's larger than FP8 and it works better.

The MB XLabs Art lora ( 342 MB ), works better than a civitai 19 MB lora.

If your VRAM is relatively small, around 8GB, I think it would be better to go with --disable-smart-memory to clear out as much VRAM as possible, rather than using --reserve-memory.

Aug 30 '24 15:08 ltdrdata

--disable-smart-memory by itself does not seem to work ( for the 2 loras workflow )

Maybe it could work in conjunction with a smaller amount of VRAM reserve ? ( I did not test that )

Aug 30 '24 16:08 JorgeR81

By the way, I made some tests with a more complex workflow ( Flux Q4_K_S + InstantX ControlNet Canny + Loras ) ( size 1024 x 1024 and 25 steps )

--disable-smart-memory -- ( I didn't tried with this workflow )
--reserve-vram 2.0 -- works fine with an XLab lora. But with 1 civitai lora ( 292 MB ) performance is very slow, with very high GPU usage and VRAM usage.
--reserve-vram 2.4 -- it works fine with 2 civitai loras ( 292 MB + 164 MB ). VRAM usage is only 6.9 GB. Generation times and GPU activity are normal.

reserve-vram is like a silver bullet !

So, how much VRAM can be reserved ?

EDIT: Apparently we can reserve at least 50% VRAM without much slowdown ! ( https://github.com/comfyanonymous/ComfyUI/issues/4693#issuecomment-2322631950 )

Aug 30 '24 17:08 JorgeR81

I'm having the same issue.

It doesn't matter if 1 or 2 Loras, it depends on the lora. I don't know how to evaluate the type of lora, but this one, for example, using flux.d model causes super slowdown: https://civitai.com/models/645425/flux-syntheticanime

I'm using a RTX4070 TI super 16gb

I tried --reserve-vram 2.4 - nothing changed

This Lora, for example, works fine: https://civitai.com/models/128568/cyberpunk-anime-style

Aug 30 '24 20:08 scofano

It doesn't matter if 1 or 2 Loras, it depends on the lora. I don't know how to evaluate the type of lora, but this one, for example, using flux.d model causes super slowdown: https://civitai.com/models/645425/flux-syntheticanime

I also have that lora and it works for me ( with --reserve-vram ).

This lora is also in huggingface, with more info. It says it's a 32 rank lora. https://huggingface.co/dataautogpt3/FLUX-SyntheticAnime

The same creator also as a 16 rank lora. See if that one works for you ? https://huggingface.co/dataautogpt3/FLUX-AestheticAnime

There is an issue, related to 64 rank loras, here: ( https://github.com/comfyanonymous/ComfyUI/issues/4681 )

Aug 30 '24 21:08 JorgeR81

@JorgeR81 Both worked this time, but first I generated a few images without any loras, then enabled it (one). And they worked(separately).

I'll spend more time on testing to be sure it is not random, but I remember seeing you say something about it (the generation order). I'm not using any reserve-vram.

Aug 30 '24 22:08 scofano

Yes, in a test I did, a single lora worked, if I first generated without a lora. This was without --reserve-vram I think this works because it allows the models to be loaded one at a time.

So maybe try : no loras => 1 lora => 2 loras

With --reserve-vram ( in the right amount ), I was always able to generate with loras, on the first run.

Aug 30 '24 22:08 JorgeR81

@JorgeR81 Did not use --reserve, but started no lora, then 1 lora, then 2, and it worked. I'll try with different loras.

Aug 31 '24 15:08 scofano

Experiencing the exact same issue.

I'm on a 4090 using

flux1-dev-Q4_0.gguf
f5-v1_1-xxl-encoder-Q8_0.gguf
Cuda 12.4.1, Torch 2.4.1

Without any LoRA I am generating a 720x1280 image at around 1.4 it/s

However using a LoRA downloaded from CivitAI (just ~20 mb), the speed will drop to about 1.3s per it which is about 50% slower.

Jan 18 '25 14:01 andypotato

I don't see this has been resolved, and its making ComfyUI completely unusable with any Flux model (without any loras at all). I have an RTX4090 and the GPU is at 100% and VRAM is at 98%. Nothing else is running?!?!

Nov 19 '25 17:11 Moorer624

Use the FP8 models instead of GGUF if your Vram allows. Using GGUF will lower your inference speed by at least 50%, even more if you use multiple Loras. I learned that the hard way.

Nov 20 '25 02:11 andypotato

ComfyUI ComfyUI copied to clipboard

Slow generation times in Flux, using loras ( fixed by using GGUF models or XLabs loras )

Actual Behavior

Other

ComfyUI
ComfyUI copied to clipboard