ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

HunyuanImage-3.0 Release

Open thaivb opened this issue 3 months ago • 78 comments

Feature Idea

This is url.... https://huggingface.co/tencent/HunyuanImage-3.0

Existing Solutions

No response

Other

No response

thaivb avatar Sep 28 '25 04:09 thaivb

We need MultiGPU, desperately

kabachuha avatar Sep 28 '25 09:09 kabachuha

@comfyanonymous will you support this model? It would be great to see how you will handle it.

nitinh12 avatar Sep 28 '25 11:09 nitinh12

I would say, even if it's supported in ComfyUI, it will still be very hard for most users to run it locally.

80B and 170GB, OMG

comfyui-wiki avatar Sep 28 '25 13:09 comfyui-wiki

Technically, it's an ~~autoregressive~~ LLM ~~(next token prediction)~~* and it's a MoE (mixture of experts), meaning you can keep all attention in VRAM and offload as few experts as you can to CPU. The bottleneck becomes RAM size and VRAM-RAM communication and it's not very bad, for example, at 5090 which supports PCI.E-5 x16.

If one has multigpu, it's even better. People launch dense models like LLaMA 3.3 70b on their 24x2 GB VRAM setups just fine at 4 bit quantization. In lack of an RTX 6000 PRO 96GBs, LocalLLaMA people have MultiGPU setups like 4x3090, and it's quite doable if you have collected GPUs since the start of the AI craze some years ago.

A large share of ComfyUI consumer users can launch 80b models (especially when they are MoEs), all we need is MultiGPU support

  • They updated the page prediction, and it's in fact a BAGEL-like AutoregressiveLLM-Diffusion generation single model hybrid, making it even faster for image generation!

kabachuha avatar Sep 28 '25 14:09 kabachuha

Yeah with ggufs (i will try to quant, i dont know how yet but i will get it to work 😅) this is 100% possible to run with cpu offload (with 64gb ram) even on a single gpu, but the more gpus and the more vram the better (;

wsbagnsv1 avatar Sep 28 '25 16:09 wsbagnsv1

they will publish a small distill size later,no need to hurry

zwukong avatar Sep 28 '25 16:09 zwukong

This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.

comfyanonymous avatar Sep 28 '25 16:09 comfyanonymous

@zwukong Having the bigger model is still better, distills usually destroy image quality and especially prompt following

kabachuha avatar Sep 28 '25 16:09 kabachuha

@comfyanonymous Have you tried it in cloud?

kabachuha avatar Sep 28 '25 16:09 kabachuha

Anyway, seems like we need quants in any case. Hopefully, people like LLaMA cpp or Nunchaku will do fp4. (Then it will be a external wrapper)

kabachuha avatar Sep 28 '25 16:09 kabachuha

haha, so big

zwukong avatar Sep 28 '25 16:09 zwukong

This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.

Being a 13B active MoE, shouldn't it be as fast as Flux with CFG?

Ulexer avatar Sep 28 '25 19:09 Ulexer

This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.

You raised 17 million fucking dollars and can't be assed to actually use the money to support the latest oss models?

Jongulo avatar Sep 28 '25 20:09 Jongulo

The Qwen-image model still seems to maintain reasonably high quality even when quantized to Q3.

I speculate that even a large 80B model, if it has strong performance and the environment is set up with proper optimization and acceleration LoRAs, could achieve generation within a realistic timeframe, given sufficient RAM and 32GB or 24GB+ VRAM. Hopefully.

souki202 avatar Sep 29 '25 06:09 souki202

even when quantized to Q3.

How did you check this? 👀

kabachuha avatar Sep 29 '25 09:09 kabachuha

I don't really know what is this excuse of "too large, probably not worth implementing". If We can run a 21 gb at fp8 model just fine with 16 GB vram single gpu, why can't we run this? Anyway here are the estimated sizes for each quant:

FP16 - 170 GB. FP8 - 85 GB. Q6 - 63 GB Q5 - 53 GB Q4 - 42 GB Q3 - 32 GB

So at least we will be able to run Q4, Q5. But i can run FP8, because waiting is not an issue for me, quality is everything, if Qwen Image takes 2 min approx at FP8 at 16 GB vram this may take approx 5 to 8 mins and i can wait for that, and also as this is an LLM it has higher understanding of world then diffusion based models, point here is it's superior understanding, so i think it's usable to community.

OrangeUnknownCat avatar Sep 29 '25 09:09 OrangeUnknownCat

170 GB is BF16, you need to shift the right column down by one position...

rkfg avatar Sep 29 '25 10:09 rkfg

Even with this 42 GB Q4 weights fully fit inside two 3090/4090s combined 48gbs VRAM. The context for single turn gen images should also fit into it, even more so in 5090 + second previous gen GPU.

kabachuha avatar Sep 29 '25 11:09 kabachuha

170 GB is BF16, you need to shift the right column down by one position...

Thank you for pointing that out, fixed it, still it needs to be implemented, due to it's moe arch it will be more efficient rather then all activated parameters.

OrangeUnknownCat avatar Sep 29 '25 12:09 OrangeUnknownCat

Tonight I’m going to fix the FSDP model patcher so it won’t cause a memory leak. If you release the model’s forward and nn module class, I might convert it to use USP and FSDP.

https://github.com/komikndr/raylight/tree/rebuilding_fsdp_again

It might become a chicken-and-egg problem, since I’m basically downstream of ComfyUI. If ComfyUI produces the model class and its forward function, I’ll convert it to be Raylight-compatible. But since ComfyUI hasn’t released it yet, it’s still a 50/50 situation.

But then again that models is to big, you need quad 16G or 24G card

komikndr avatar Sep 29 '25 12:09 komikndr

@comfyanonymous @comfyui-wiki I managed to get the model running locally via transformers with cpu offload on a single 5090 + 170gb ram.

I get around 15s/it with the full model, but since transformers cpu implementation is horrendous with proper blockswap that could be a lot faster.

I get good images at 25 steps. The output quality is much much higher than qwen image. More aesthetic and also has better prompt adherence.

timkhronos avatar Sep 29 '25 13:09 timkhronos

What is your RAM frequency/channels, CPU, PCI-E bandwidth?

kabachuha avatar Sep 29 '25 14:09 kabachuha

Fully consumer system, dual channel ddr5, running at 6000mhz, 79gb/s Read bandwidth. 9950x, gpu is running at pcie gen5 x16 on an x870e motherboard.

timkhronos avatar Sep 29 '25 15:09 timkhronos

I wonder if if would it be possible to add 8/4 bit out-of-the-box quantization with bitsandbytes and accelerate, just like Bytedance's Bagel does successfully in https://github.com/ByteDance-Seed/Bagel/blob/7026cfa0a4df274460d0b0b990117398a4ec6fca/app.py#L115-L122. (also see ComfyUI-Bagel)

On top of that, nf4 quantization will create a folder with the quantized 4bit model and it is simple to upload it to Huggingface for people to download quickly!

kabachuha avatar Sep 30 '25 08:09 kabachuha

I tried load_and_quantize_model to nf4 with bitsandbytes, but the saved model size is the same as the original one for some reason. Maybe it has too much custom code.

kabachuha avatar Oct 01 '25 09:10 kabachuha

This looks like a model that people are going to try once, wait more than 10 minutes for a 1MP image and then never use again so it's probably not worth implementing.

@comfyanonymous So you will not support this model?

nitinh12 avatar Oct 02 '25 08:10 nitinh12

It's a very inefficient model that doesn't really do anything special and the license on the code sucks.

The only way this is getting implemented is if they do it themselves, change the license on the code or they release a better model.

comfyanonymous avatar Oct 04 '25 02:10 comfyanonymous

Your comments have raised some concerns for me regarding the future direction of ComfyUI.

It seems likely that future Sora2-class models, and others that enable advanced text-image interaction, will be what we might call "world models," built on vision LLMs with tens or hundreds of billions of parameters. I feel that Hunyuanimage-3.0 is also aiming to be such a world model, rather than just a simple image generation and editing tool. The images posted on X, which suggest an understanding of mathematical concepts, seem to be evidence of this.

If the reason for not supporting it is simply that it's "huge and inefficient," it leads to a concern that future, more capable (and likely equally huge) world models might not be supported either.

However, you also stated you would consider implementing it if "they change the license on the code or they release a better model." The roadmap for Hunyuanimage-3.0 includes "Image-to-Image Generation" and "Multi-turn Interaction." If the development team implements these features and their performance proves it to be a "better model," can I assume that you would then be willing to consider supporting it in ComfyUI?

souki202 avatar Oct 04 '25 03:10 souki202

Go implement it yourself and you'll see how little people actually care about this model.

If you want "an understanding of mathematical concepts" you can just stick an LLM to enhance the prompt of any of the popular diffusion models and you will get better results than hunyuan image 3.0

If someone releases a big model that is good for its size I will implement it.

comfyanonymous avatar Oct 04 '25 04:10 comfyanonymous

Your comments have raised some concerns for me regarding the future direction of ComfyUI.

It seems likely that future Sora2-class models, and others that enable advanced text-image interaction, will be what we might call "world models," built on vision LLMs with tens or hundreds of billions of parameters. I feel that Hunyuanimage-3.0 is also aiming to be such a world model, rather than just a simple image generation and editing tool. The images posted on X, which suggest an understanding of mathematical concepts, seem to be evidence of this.

If the reason for not supporting it is simply that it's "huge and inefficient," it leads to a concern that future, more capable (and likely equally huge) world models might not be supported either.

However, you also stated you would consider implementing it if "they change the license on the code or they release a better model." The roadmap for Hunyuanimage-3.0 includes "Image-to-Image Generation" and "Multi-turn Interaction." If the development team implements these features and their performance proves it to be a "better model," can I assume that you would then be willing to consider supporting it in ComfyUI?

You are absolutely correct. I am not sure if Comfy has even used this model yet. I asked for an image explaining how to take cinematic fashion shots. All of the text is self-generated by the model. It's open-source GPT image 1. I also think that future models will be like this as well. They will release vLLM support this month.

Image

nitinh12 avatar Oct 04 '25 11:10 nitinh12