stable-diffusion-webui-forge
stable-diffusion-webui-forge copied to clipboard
Stable Diffusion Inference Speed drastically reduces when generating FLUX images with LORA (flux1-dev-nf4-v2)
System Specs: 3080 10G with 32gb RAM
This is potentially an issue to low VRAM, and system not freeing up enough memory before next generation. But can't say for sure,
Logs for debugging:
Startup time: 95.0s (prepare environment: 27.6s, launcher: 2.2s, import torch: 10.5s, initialize shared: 0.2s, other imports: 7.6s, setup gfpgan: 0.1s, list SD models: 0.2s, load scripts: 40.8s, create ui: 3.8s, gradio launch: 1.9s).
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
Loading Model: {'checkpoint_info': {'filename': 'B:\\Games\\Forge\\webui\\models\\Stable-diffusion\\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': [], 'unet_storage_dtype': None}
StateDict Keys: {'transformer': 1722, 'vae': 244, 'text_encoder': 198, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: nf4
Using pre-quant state dict!
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 2.3s (unload existing model: 0.2s, forge model load: 2.1s).
[LORA] Loaded B:\Games\Forge\webui\models\Lora\scg-anatomy-female-5000.safetensors for KModel-UNet with 324 keys at weight 1.0 (skipped 0 keys)
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
To load target model JointTextEncoder
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 8901.92 MB
[Memory Management] Required Model Memory: 5154.62 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 2723.30 MB
LoRA patching has taken 7.99 seconds
Moving model(s) has taken 8.00 seconds
Distilled CFG Scale: 7
To load target model KModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 8814.70 MB
[Memory Management] Required Model Memory: 6246.84 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 1543.86 MB
Patching LoRAs: 100%|████████████████████████████████████████████████████████████████| 134/134 [00:05<00:00, 26.32it/s]
LoRA patching has taken 7.26 seconds
Moving model(s) has taken 9.30 seconds
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:32<00:00, 1.61s/it]
To load target model IntegratedAutoencoderKL████████████████████████████████████████████████████████| 20/20 [00:30<00:00, 1.67s/it]
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 5455.67 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 4271.80 MB
Moving model(s) has taken 0.06 seconds
Total progress: 100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:39<00:00, 1.99s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.█████████████████████| 20/20 [00:39<00:00, 1.67s/it]
To load target model JointTextEncoder
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 8223.38 MB
[Memory Management] Required Model Memory: 5227.11 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 1972.27 MB
LoRA patching has taken 0.88 seconds
Moving model(s) has taken 4.11 seconds
Distilled CFG Scale: 7
To load target model KModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 7209.18 MB
[Memory Management] Required Model Memory: 6246.84 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -61.66 MB
Patching LoRAs: 100%|█████████████████████████████████████████████████████████████████████████████| 134/134 [01:14<00:00, 1.80it/s]
LoRA patching has taken 74.40 seconds
[Memory Management] Loaded to CPU Swap: 1486.29 MB (blocked method)
[Memory Management] Loaded to GPU: 4760.48 MB
Moving model(s) has taken 77.67 seconds
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:35<00:00, 1.77s/it]
To load target model IntegratedAutoencoderKL████████████████████████████████████████████████████████| 20/20 [00:32<00:00, 1.70s/it]
Begin to load 1 model
[Memory Management] Current Free GPU Memory: 7285.02 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 6101.15 MB
Moving model(s) has taken 0.76 seconds
I'm running into this also on 3090. Wiping everything by relaunching works 90% of the time. Happens mostly after switching checkpoints- even Q4 can bump into the ceiling or adding a LORA. I've even seen VRAM load increase by 200MB after removing the LORA from the prompt. For now I make sure to set everything before generating for the first time, to avoid switching or adding anything later.
I occasionally have that issue with LoRAs made in the AI toolkit, but I can't discern a pattern on how to reproduce it. Can you check that issue also, please @lllyasviel
Can confirm that with ai-toolkit lora it is like 10x times slower inference on 3090ti.