krita-ai-diffusion request: for update for FLUX NF4 Model support

request for update for FLUX NF4 Model support - Lighter and faster yes im a pain in the backside but my old rtx3060 8g will love you forever, im shock how fast sd3 is faster than some sdxl models i use. loving. dad passed away but soon as im up to it ill do videos on the clip etc and what not to get others going and supporting you. love your work it means a lot to many and myself

https://openart.ai/workflows/cgtips/comfyui---flux-nf4-model---lighter-and-faster/xgXUBq2E14uoHdyx2LTe

https://www.youtube.com/watch?v=L5BjuPgF1Ds

Aug 13 '24 08:08 Streamtabulous

My fork can use NF4-dev, and HYDit but it only uses the method of checking the flux1-dev-bnb-nf4.safetensors file name and HYDit can't be working on upscale, not the real solution, so I won’t do the pull, just for a test. You can try it if you have installed ComfyUI_bitsandbytes_NF4 and got it work in ComfyUI.

Aug 14 '24 05:08 dseditor

lllyasviel flux1-dev-bnb-nf4 v2! is now available:

https://civitai.com/models/645429 https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4

Aug 14 '24 14:08 GKartist75

It seems that the NF4 results (improvements) are really dependent on what VRAM and GPU you have. For those who can run schnell just fine, NF4 can even show a slight decrease in processing time without any benefit to quality or details (see: https://www.reddit.com/r/StableDiffusion/comments/1erv8x0/comparison_nf4v2_against_fp8/).

I point this out because NF4 support should come with a warning that it is intended primarily for those who have cards like an rtx3060. My 3070 runs schnell at around ~6min render time for 1024x1024. I'm going to do some tests with nf4 soon in comfy to see if I get any better.

Aug 14 '24 14:08 waynemadsen

Yes please, nf4 is significantly faster on below 12GB VRAM cards.

Aug 14 '24 14:08 MrUSBEN

i have a rtx 3080 10gb, and 1024x1024, within 26 seconds, including loading model, and 20 steps

Aug 14 '24 15:08 GKartist75

RTX3070 8GB VRAM with 24GB RAM text 2 img @ CGF 3 & Steps 20

schnell: ~6min in comfy, ~3 in krita, ~10 with adding realism lora nf4 v2: ~30min in comfy (CPU 100%) nf4 v1: ~15 min in comfy

my cpu is 8 years old. I'm thinking the optimizations in nf4 are all CPU side which isn't my pc's strength. Unless I'm doing something terribly wrong. I was using workflows provided by sebastian kamph. For now, I'm sticking to schnell.

Aug 14 '24 18:08 waynemadsen

20 steps, my 3070 graphics card takes 1050 seconds, it's too slow!!!

Aug 15 '24 06:08 syl1130

20 steps, my 3070 graphics card takes 1050 seconds, it's too slow!!!

ouch you have something going on, my rtx3060 8gig is 5min tops on 1024x1024 and i watch YouTube while its working, on the schnell model in Krita.

Aug 15 '24 07:08 Streamtabulous

I've read somewhere that comfyui doesn't seem to play nice with nf4, but forge is super fast. I'm not sure why the interface would have much to do with the render time, but I've seen this happen on my own machine. When using comfy to run schnell, I get 8 minute render times, but with krita it will run faster. I'm really hopeful that when nf4 v2 support comes to krita I will see amazing run times.

Aug 15 '24 15:08 waynemadsen

omg what magic is in forge, 54sec while using YouTube Untitled42_20240816142615

Aug 16 '24 06:08 Streamtabulous

On my system NF4-v1 is significantly faster than any other implementation of FLUX (even GGUF Q4, or than NF4-v2 on Forge), but I have a strange bug where the NF4-v2 by is much much slower than v1, give it a try if you have bad performances with NF4. My system : 3070Ti 8GB VRAM & 32GB RAM.

That being said, the readme from ComfyUI_bitsandbytes_NF4 seems to indicate that GGUF will be the preferred way to load quantized models.

Aug 16 '24 12:08 Danamir

On my system NF4-v1 is significantly faster than any other implementation of FLUX (even GGUF Q4, or than NF4-v2 on Forge), but I have a strange bug where the NF4-v2 by is much much slower than v1, give it a try if you have bad performances with NF4. My system : 3070Ti 8GB VRAM & 32GB RAM.

That being said, the readme from ComfyUI_bitsandbytes_NF4 seems to indicate that GGUF will be the preferred way to load quantized models.

68sec today on nf4 v1 and nf4 v2 changing settings etc remains same time on my rtx3060 8g in forge , in krita schnell is 5min. note in watching YouTube at 1080p, downloading a 60 tabs open as i like to do stuff while its running. i do have VS installed c++ development python development and cuda tool kit installed as one youtuber recommended them and i do see better speeds with them installed

Aug 17 '24 07:08 Streamtabulous

NF4 is still faster for me, but GGUF is getting LoRA support (still some minor bug to iron out, but right now it's working with Q4). Cf. https://github.com/city96/ComfyUI-GGUF

Aug 17 '24 10:08 Danamir

The generation speed of NF4 is very satisfactory for my 2060 Super 8GB. The time of using comyui workflow: flux_schnell_bnb_nf4 4 steps, 1024x1024 : 23 seconds flux_dev_bnb_nf4 v1 20 steps, 1024x1024 : 78 seconds PixPin_2024-08-19_15-00-56 PixPin_2024-08-19_14-54-32

Aug 19 '24 07:08 gmf1982

Same. I'm waiting for the plugin to support it too.

Aug 19 '24 16:08 NoMansPC

Added a temporary solution in #1104 if you want to test things out, and are not afraid of altering the plugin code.

Aug 23 '24 11:08 Danamir

I'm a little bit confused. I can use the default schnell Flux model that comes with krita-ai-diffisuion. But it doesn't work with the nf4 model. I got this error:

 File "C:\Users\rainc\github\ComfyUI\ComfyUI\comfy\model_base.py", line 222, in load_model_weights
    m, u = self.diffusion_model.load_state_dict(to_load, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rainc\github\ComfyUI\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2189, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Flux:
        size mismatch for img_in.weight: copying a param with shape torch.Size([98304, 1]) from checkpoint, the shape in current model is torch.Size([3072, 64]).
        size mismatch for time_in.in_layer.weight: copying a param with shape torch.Size([393216, 1]) from checkpoint, the shape in current model is torch.Size([3072, 256]).

...(a lot of similar errors)...

Aug 25 '24 13:08 yhslai

See https://github.com/Acly/krita-ai-diffusion/discussions/1176 for the current state. You can use GGUF models now for 4/5/6/7/8 bit depending on how much VRAM you have.

Sep 13 '24 11:09 Acly

krita-ai-diffusion
krita-ai-diffusion copied to clipboard

request: for update for FLUX NF4 Model support - Lighter and faster

krita-ai-diffusion krita-ai-diffusion copied to clipboard

request: for update for FLUX NF4 Model support - Lighter and faster

krita-ai-diffusion
krita-ai-diffusion copied to clipboard