HunyuanImage-3.0 Release
Feature Idea
This is url.... https://huggingface.co/tencent/HunyuanImage-3.0
Existing Solutions
No response
Other
No response
We need MultiGPU, desperately
@comfyanonymous will you support this model? It would be great to see how you will handle it.
I would say, even if it's supported in ComfyUI, it will still be very hard for most users to run it locally.
80B and 170GB, OMG
Technically, it's an ~~autoregressive~~ LLM ~~(next token prediction)~~* and it's a MoE (mixture of experts), meaning you can keep all attention in VRAM and offload as few experts as you can to CPU. The bottleneck becomes RAM size and VRAM-RAM communication and it's not very bad, for example, at 5090 which supports PCI.E-5 x16.
If one has multigpu, it's even better. People launch dense models like LLaMA 3.3 70b on their 24x2 GB VRAM setups just fine at 4 bit quantization. In lack of an RTX 6000 PRO 96GBs, LocalLLaMA people have MultiGPU setups like 4x3090, and it's quite doable if you have collected GPUs since the start of the AI craze some years ago.
A large share of ComfyUI consumer users can launch 80b models (especially when they are MoEs), all we need is MultiGPU support
- They updated the page prediction, and it's in fact a BAGEL-like AutoregressiveLLM-Diffusion generation single model hybrid, making it even faster for image generation!
Yeah with ggufs (i will try to quant, i dont know how yet but i will get it to work 😅) this is 100% possible to run with cpu offload (with 64gb ram) even on a single gpu, but the more gpus and the more vram the better (;
they will publish a small distill size later,no need to hurry
This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.
@zwukong Having the bigger model is still better, distills usually destroy image quality and especially prompt following
@comfyanonymous Have you tried it in cloud?
Anyway, seems like we need quants in any case. Hopefully, people like LLaMA cpp or Nunchaku will do fp4. (Then it will be a external wrapper)
haha, so big
This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.
Being a 13B active MoE, shouldn't it be as fast as Flux with CFG?
This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.
You raised 17 million fucking dollars and can't be assed to actually use the money to support the latest oss models?
The Qwen-image model still seems to maintain reasonably high quality even when quantized to Q3.
I speculate that even a large 80B model, if it has strong performance and the environment is set up with proper optimization and acceleration LoRAs, could achieve generation within a realistic timeframe, given sufficient RAM and 32GB or 24GB+ VRAM. Hopefully.
even when quantized to Q3.
How did you check this? 👀
I don't really know what is this excuse of "too large, probably not worth implementing". If We can run a 21 gb at fp8 model just fine with 16 GB vram single gpu, why can't we run this? Anyway here are the estimated sizes for each quant:
FP16 - 170 GB. FP8 - 85 GB. Q6 - 63 GB Q5 - 53 GB Q4 - 42 GB Q3 - 32 GB
So at least we will be able to run Q4, Q5. But i can run FP8, because waiting is not an issue for me, quality is everything, if Qwen Image takes 2 min approx at FP8 at 16 GB vram this may take approx 5 to 8 mins and i can wait for that, and also as this is an LLM it has higher understanding of world then diffusion based models, point here is it's superior understanding, so i think it's usable to community.
170 GB is BF16, you need to shift the right column down by one position...
Even with this 42 GB Q4 weights fully fit inside two 3090/4090s combined 48gbs VRAM. The context for single turn gen images should also fit into it, even more so in 5090 + second previous gen GPU.
170 GB is BF16, you need to shift the right column down by one position...
Thank you for pointing that out, fixed it, still it needs to be implemented, due to it's moe arch it will be more efficient rather then all activated parameters.
Tonight I’m going to fix the FSDP model patcher so it won’t cause a memory leak. If you release the model’s forward and nn module class, I might convert it to use USP and FSDP.
https://github.com/komikndr/raylight/tree/rebuilding_fsdp_again
It might become a chicken-and-egg problem, since I’m basically downstream of ComfyUI. If ComfyUI produces the model class and its forward function, I’ll convert it to be Raylight-compatible. But since ComfyUI hasn’t released it yet, it’s still a 50/50 situation.
But then again that models is to big, you need quad 16G or 24G card
@comfyanonymous @comfyui-wiki I managed to get the model running locally via transformers with cpu offload on a single 5090 + 170gb ram.
I get around 15s/it with the full model, but since transformers cpu implementation is horrendous with proper blockswap that could be a lot faster.
I get good images at 25 steps. The output quality is much much higher than qwen image. More aesthetic and also has better prompt adherence.
What is your RAM frequency/channels, CPU, PCI-E bandwidth?
Fully consumer system, dual channel ddr5, running at 6000mhz, 79gb/s Read bandwidth. 9950x, gpu is running at pcie gen5 x16 on an x870e motherboard.
I wonder if if would it be possible to add 8/4 bit out-of-the-box quantization with bitsandbytes and accelerate, just like Bytedance's Bagel does successfully in https://github.com/ByteDance-Seed/Bagel/blob/7026cfa0a4df274460d0b0b990117398a4ec6fca/app.py#L115-L122. (also see ComfyUI-Bagel)
On top of that, nf4 quantization will create a folder with the quantized 4bit model and it is simple to upload it to Huggingface for people to download quickly!
I tried load_and_quantize_model to nf4 with bitsandbytes, but the saved model size is the same as the original one for some reason. Maybe it has too much custom code.
This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.
@comfyanonymous So you will not support this model?
It's a very inefficient model that doesn't really do anything special and the license on the code sucks.
The only way this is getting implemented is if they do it themselves, change the license on the code or they release a better model.
Your comments have raised some concerns for me regarding the future direction of ComfyUI.
It seems likely that future Sora2-class models, and others that enable advanced text-image interaction, will be what we might call "world models," built on vision LLMs with tens or hundreds of billions of parameters. I feel that Hunyuanimage-3.0 is also aiming to be such a world model, rather than just a simple image generation and editing tool. The images posted on X, which suggest an understanding of mathematical concepts, seem to be evidence of this.
If the reason for not supporting it is simply that it's "huge and inefficient," it leads to a concern that future, more capable (and likely equally huge) world models might not be supported either.
However, you also stated you would consider implementing it if "they change the license on the code or they release a better model." The roadmap for Hunyuanimage-3.0 includes "Image-to-Image Generation" and "Multi-turn Interaction." If the development team implements these features and their performance proves it to be a "better model," can I assume that you would then be willing to consider supporting it in ComfyUI?
Go implement it yourself and you'll see how little people actually care about this model.
If you want "an understanding of mathematical concepts" you can just stick an LLM to enhance the prompt of any of the popular diffusion models and you will get better results than hunyuan image 3.0
If someone releases a big model that is good for its size I will implement it.
Your comments have raised some concerns for me regarding the future direction of ComfyUI.
It seems likely that future Sora2-class models, and others that enable advanced text-image interaction, will be what we might call "world models," built on vision LLMs with tens or hundreds of billions of parameters. I feel that Hunyuanimage-3.0 is also aiming to be such a world model, rather than just a simple image generation and editing tool. The images posted on X, which suggest an understanding of mathematical concepts, seem to be evidence of this.
If the reason for not supporting it is simply that it's "huge and inefficient," it leads to a concern that future, more capable (and likely equally huge) world models might not be supported either.
However, you also stated you would consider implementing it if "they change the license on the code or they release a better model." The roadmap for Hunyuanimage-3.0 includes "Image-to-Image Generation" and "Multi-turn Interaction." If the development team implements these features and their performance proves it to be a "better model," can I assume that you would then be willing to consider supporting it in ComfyUI?
You are absolutely correct. I am not sure if Comfy has even used this model yet. I asked for an image explaining how to take cinematic fashion shots. All of the text is self-generated by the model. It's open-source GPT image 1. I also think that future models will be like this as well. They will release vLLM support this month.