krita-ai-diffusion
krita-ai-diffusion copied to clipboard
request: for update for FLUX NF4 Model support - Lighter and faster
request for update for FLUX NF4 Model support - Lighter and faster yes im a pain in the backside but my old rtx3060 8g will love you forever, im shock how fast sd3 is faster than some sdxl models i use. loving. dad passed away but soon as im up to it ill do videos on the clip etc and what not to get others going and supporting you. love your work it means a lot to many and myself
https://openart.ai/workflows/cgtips/comfyui---flux-nf4-model---lighter-and-faster/xgXUBq2E14uoHdyx2LTe
https://www.youtube.com/watch?v=L5BjuPgF1Ds
My fork can use NF4-dev, and HYDit but it only uses the method of checking the flux1-dev-bnb-nf4.safetensors file name and HYDit can't be working on upscale, not the real solution, so I won’t do the pull, just for a test. You can try it if you have installed ComfyUI_bitsandbytes_NF4 and got it work in ComfyUI.
lllyasviel flux1-dev-bnb-nf4 v2! is now available:
https://civitai.com/models/645429 https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4
It seems that the NF4 results (improvements) are really dependent on what VRAM and GPU you have. For those who can run schnell just fine, NF4 can even show a slight decrease in processing time without any benefit to quality or details (see: https://www.reddit.com/r/StableDiffusion/comments/1erv8x0/comparison_nf4v2_against_fp8/).
I point this out because NF4 support should come with a warning that it is intended primarily for those who have cards like an rtx3060. My 3070 runs schnell at around ~6min render time for 1024x1024. I'm going to do some tests with nf4 soon in comfy to see if I get any better.
Yes please, nf4 is significantly faster on below 12GB VRAM cards.
i have a rtx 3080 10gb, and 1024x1024, within 26 seconds, including loading model, and 20 steps
RTX3070 8GB VRAM with 24GB RAM text 2 img @ CGF 3 & Steps 20
schnell: ~6min in comfy, ~3 in krita, ~10 with adding realism lora nf4 v2: ~30min in comfy (CPU 100%) nf4 v1: ~15 min in comfy
my cpu is 8 years old. I'm thinking the optimizations in nf4 are all CPU side which isn't my pc's strength. Unless I'm doing something terribly wrong. I was using workflows provided by sebastian kamph. For now, I'm sticking to schnell.
20 steps, my 3070 graphics card takes 1050 seconds, it's too slow!!!
20 steps, my 3070 graphics card takes 1050 seconds, it's too slow!!!
ouch you have something going on, my rtx3060 8gig is 5min tops on 1024x1024 and i watch YouTube while its working, on the schnell model in Krita.
I've read somewhere that comfyui doesn't seem to play nice with nf4, but forge is super fast. I'm not sure why the interface would have much to do with the render time, but I've seen this happen on my own machine. When using comfy to run schnell, I get 8 minute render times, but with krita it will run faster. I'm really hopeful that when nf4 v2 support comes to krita I will see amazing run times.
omg what magic is in forge, 54sec while using YouTube
On my system NF4-v1 is significantly faster than any other implementation of FLUX (even GGUF Q4, or than NF4-v2 on Forge), but I have a strange bug where the NF4-v2 by is much much slower than v1, give it a try if you have bad performances with NF4. My system : 3070Ti 8GB VRAM & 32GB RAM.
That being said, the readme from ComfyUI_bitsandbytes_NF4 seems to indicate that GGUF will be the preferred way to load quantized models.
On my system NF4-v1 is significantly faster than any other implementation of FLUX (even GGUF Q4, or than NF4-v2 on Forge), but I have a strange bug where the NF4-v2 by is much much slower than v1, give it a try if you have bad performances with NF4. My system : 3070Ti 8GB VRAM & 32GB RAM.
That being said, the readme from
ComfyUI_bitsandbytes_NF4seems to indicate that GGUF will be the preferred way to load quantized models.
68sec today on nf4 v1 and nf4 v2 changing settings etc remains same time on my rtx3060 8g in forge , in krita schnell is 5min. note in watching YouTube at 1080p, downloading a 60 tabs open as i like to do stuff while its running. i do have VS installed c++ development python development and cuda tool kit installed as one youtuber recommended them and i do see better speeds with them installed
NF4 is still faster for me, but GGUF is getting LoRA support (still some minor bug to iron out, but right now it's working with Q4). Cf. https://github.com/city96/ComfyUI-GGUF
The generation speed of NF4 is very satisfactory for my 2060 Super 8GB. The time of using comyui workflow:
flux_schnell_bnb_nf4 4 steps, 1024x1024 : 23 seconds
flux_dev_bnb_nf4 v1 20 steps, 1024x1024 : 78 seconds
Same. I'm waiting for the plugin to support it too.
Added a temporary solution in #1104 if you want to test things out, and are not afraid of altering the plugin code.
I'm a little bit confused. I can use the default schnell Flux model that comes with krita-ai-diffisuion. But it doesn't work with the nf4 model. I got this error:
File "C:\Users\rainc\github\ComfyUI\ComfyUI\comfy\model_base.py", line 222, in load_model_weights
m, u = self.diffusion_model.load_state_dict(to_load, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rainc\github\ComfyUI\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2189, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Flux:
size mismatch for img_in.weight: copying a param with shape torch.Size([98304, 1]) from checkpoint, the shape in current model is torch.Size([3072, 64]).
size mismatch for time_in.in_layer.weight: copying a param with shape torch.Size([393216, 1]) from checkpoint, the shape in current model is torch.Size([3072, 256]).
...(a lot of similar errors)...
See https://github.com/Acly/krita-ai-diffusion/discussions/1176 for the current state. You can use GGUF models now for 4/5/6/7/8 bit depending on how much VRAM you have.
20 steps, my 3070 graphics card takes 1050 seconds, it's too slow!!!