会不会针对FLUX2.进行模型的释放呢?
Checklist
- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/mit-han-lab/nunchaku/discussions/new/choose. Otherwise, it will be closed.
- [ ] 2. I will do my best to describe the issue in English.
Motivation
如title所示
Related resources
No response
+1
+1,等flux2和z-image的支持
Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation
1
Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation
I don't believe one is better than the other, but Flux2 has a much better understanding of subjects and different ways people and objects can look. People made in z-image look too perfect. Both look great, but getting flux2 smaller makes more sense as z-image is already small and fast.
I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards.
Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.
Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation
I don't believe one is better than the other, but Flux2 has a much better understanding of subjects and different ways people and objects can look. People made in z-image look too perfect. Both look great, but getting flux2 smaller makes more sense as z-image is already small and fast.
cant agree more.
I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards.
Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.
You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku.
You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit.
IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given
I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards. Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.
You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku.
You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit.
IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given
Is this with using the fp8 or fp16 text encoder?
I agree. Also, focusing on a turbo distilled model with relatively small paramaters seems odd, it's already incredibly fast and runs on midrange consumer cards. Flux 2 FP4/INT4 would theoretically allow for zero/minimal offloading on 24GB cards like the 3090/4090.
You still need to offload even at 4bit with both the text-encoder and transformer. Takes it over 32gb from my testing. Still have to pipe.enable_model_cpu_offload() this is just with bitsandbytes though. Which is not as good as nunchaku. You could compute prompt embeds separate though as this comes in at around 16gb for the transfomer at 4bit. IMO the text_encoder is bit ridiculous for flux-2,.dev in terms of the results given
Is this with using the fp8 or fp16 text encoder?
I've tried 4 bit and 8 bit on the GPU and bfloat16 on the cpu. It's more its comparable to Qwen 2.5 used in qwen image. Just much more massive without much noticeable difference. And as people are mentioning Z-image is also using Qwen 3 for the text-encoder to great effect which is a much smaller model. So not sure why they are using such a heavy Mistrial LLM.
I'm not am expert on all these details just going by how things seem from my minimal testing.
Z-image is much better than Flux2, so we prioritize adapting to it and can enjoy sub second generation
But Z-image is already pretty fast and pretty easy to run even on mid-range hardware. I've seen people on laptops run it well. In my opinion, Nunchaku for Flux.2 makes much more sense since it's almost impossible to run it on mid-range hardware without very solid quantization.
Considering the purpose of the Nunchaku project is to enable the operation of models that are difficult to run on civilian GPUs,
and Z-image can already run on most civilian GPUs, Flux.2 seems urgent.
I'll go for Z-image accelerated via nunchaku. It would be awesome
+1 for Z-Image! I've made a feature request: #814
Considering the purpose of the Nunchaku project is to enable the operation of models that are difficult to run on civilian GPUs,
and Z-image can already run on most civilian GPUs, Flux.2 seems urgent.
That's not the only goal. You can quant to 4bit with many other methods (to get a similar size), but Nunchaku method preserves very close to the 16bit look and is 3x times faster inference!
We are waiting for flux2 nunchaku!