feat: Longcat-Image / Longcat-Image-Edit support
for https://github.com/leejet/stable-diffusion.cpp/issues/1052
sd.exe --diffusion-model ..\ComfyUI\models\unet\LongCat-Image-Q8_0.gguf --vae ..\ComfyUI\models\vae\flux\ae.safetensors --cfg-scale 4.0 --sampling-method euler -v --clip-on-cpu -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: \"THE CITY IS A CIRCUIT BOARD, AND I AM A LONG CAT.\" -- moody, atmospheric, profound, dark academic" --preview proj --steps 20 --qwen2vl ..\ComfyUI\models\clip\Qwen2.5-VL-7B-Instruct.Q4_K_M.gguf --diffusion-fa --color -W 1024 -H 1024
Test models (converted to bfl format) can be found there:
- https://huggingface.co/stduhpf/LongCat-Image-gguf/tree/main
- https://huggingface.co/stduhpf/LongCat-Image-Edit-gguf/tree/main
- https://huggingface.co/stduhpf/LongCat-Image-Dev-gguf/tree/main
Inference for models in diffusers format seem to be still broken
That does look a bit like a circuit board...
TODO for when image generation works
I can't figure out what I'm doing wrong, I think it is supposed to be working just like Flux1, but with different PE indices and Qwen Text Encoder.... Maybe I'm missing an important detail but I can't find it.
I tried using my SplitAttention thing on a Flux model converted to diffusers format, and
I guess I found what is not working. I will try converting LongCat to Flux format and see if it works.
I think I got it?
With the padding fixed, but with diffusers format:
With the character-level tokenization trick:
Might need testing to make sure the current implementation supports languages that don't use the latin alphabet. Also for now it's applied to text wrapped in single quotes ( ') only.
Oh no, why are there so many conflicts now?
Using ' as a quote delimiter was a bad idea because it's the same symbol used for apostrophes. I will change it to detect " instead
Somehow not fully working yet, but it's definitely able to see it's supposed to be a cat holding a sign, maybe because of the vision model
sd.exe --diffusion-model ..\ComfyUI\models\unet\longcat_edit_bfl_format-Q8_0.gguf --vae ..\ComfyUI\models\vae\flux\ae.safetensors --cfg-scale 4.5 --sampling-method euler -v --offload-to-cpu --preview proj --steps 50 --vae-tile-size 128 --qwen2vl ..\ComfyUI\models\clip\Qwen2.5-VL-7B-Instruct.Q4_K_M.gguf --color --seed 0 -r .\assets\flux\flux1-dev-q8_0.png --llm_vision ..\ComfyUI\models\clip_vision\Qwen2.5-VL-7B-Instruct.mmproj-f16.gguf -p "Change the text to say \"I'm a long one\""
| ref | out |
|---|---|
(Also I made the change so it now needs double quotes around literal text)
Somehow couldn't get it to remove the original text, but there it goes
May I ask which comfyui node is used to load this GGUF model?
Now supports UTF-8 encoding properly for the quoted text. (also quote characters are no longer excluded from the prompt after being parsed, seems to help a bit, especially with longer text.)
May I ask which comfyui node is used to load this GGUF model?
@Rocky-Lee-001 I don't think LongCat-Image is natively supported by ComfyUI yet. You could give https://github.com/sooxt98/comfyui_longcat_image a try, maybe it works well with the GGUF node for comfyUI?
I’m not sure whether I did something wrong on my end, but I got a strange image.
.\bin\Release\sd-cli.exe --diffusion-model ..\models\longcat_bfl_format-Q4_K_M.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --llm ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p 'a lovely cat' --cfg-scale 5.0 -v --offload-to-cpu --diffusion-fa
@leejet that's strange. I can reproduce it with the same prompt though (even with Q8_0 model), but I haven't gotten anything like this in my earlier testing. Maybe There's a linear layer that could use scaling?
Does not seem related to seed.
It's a combination of short prompts + low resolution that seems to cause it.