nunchaku 会不会针对FLUX2.进行模型的释放呢？

Checklist

[ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/mit-han-lab/nunchaku/discussions/new/choose. Otherwise, it will be closed.
[ ] 2. I will do my best to describe the issue in English.

Motivation

如title所示

Related resources

No response

Nov 26 '25 01:11 Code-dogcreatior

+1

Nov 26 '25 15:11 tonera

+1，等flux2和z-image的支持

Nov 27 '25 09:11 keven1024

Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation

Nov 28 '25 10:11 liuguicen

1

Nov 28 '25 13:11 oyjh1026-prog

Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation

I don't believe one is better than the other, but Flux2 has a much better understanding of subjects and different ways people and objects can look. People made in z-image look too perfect. Both look great, but getting flux2 smaller makes more sense as z-image is already small and fast.

Nov 28 '25 22:11 CeciliaXCIX

I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards.

Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.

Nov 29 '25 18:11 Zorgonatis

Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation

I don't believe one is better than the other, but Flux2 has a much better understanding of subjects and different ways people and objects can look. People made in z-image look too perfect. Both look great, but getting flux2 smaller makes more sense as z-image is already small and fast.

cant agree more.

Nov 30 '25 12:11 cvtower

I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards.

Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.

You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku.

You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit.

IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given

Nov 30 '25 12:11 JoeGaffney

I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards. Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.

You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku.

You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit.

IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given

Is this with using the fp8 or fp16 text encoder?

Nov 30 '25 18:11 CeciliaXCIX

I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards. Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.

You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku. You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit. IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given

Is this with using the fp8 or fp16 text encoder?

I've tried 4 bit and 8 bit on the GPU and bfloat16 on the cpu. It's more its comparable to Qwen 2.5 used in qwen image. Just much more massive without much noticeable difference. And as people are mentioning Z-image is also using Qwen 3 for the text-encoder to great effect which is a much smaller model. So not sure why they are using such a heavy Mistrial LLM.

I'm not am expert on all these details just going by how things seem from my minimal testing.

Nov 30 '25 19:11 JoeGaffney

Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation

But Z-image is already pretty fast and pretty easy to run even on mid-range hardware. I've seen people on laptops run it well. In my opinion, Nunchaku for Flux.2 makes much more sense since it's almost impossible to run it on mid-range hardware without very solid quantization.

Dec 02 '25 02:12 alex-mitov

Considering the purpose of the Nunchaku project is to enable the operation of models that are difficult to run on civilian GPUs,

and Z-image can already run on most civilian GPUs, Flux.2 seems urgent.

Dec 02 '25 23:12 wkdtjs

I'll go for Z-image accelerated via nunchaku. It would be awesome

Dec 03 '25 00:12 iamwavecut

+1 for Z-Image! I've made a feature request: #814

Dec 03 '25 19:12 atgctg

Considering the purpose of the Nunchaku project is to enable the operation of models that are difficult to run on civilian GPUs,

and Z-image can already run on most civilian GPUs, Flux.2 seems urgent.

That's not the only goal. You can quant to 4bit with many other methods (to get a similar size), but Nunchaku method preserves very close to the 16bit look and is 3x times faster inference!

Dec 03 '25 20:12 JoeGaffney

We are waiting for flux2 nunchaku!

Dec 05 '25 13:12 jarkevithwlad