stable-diffusion-webui-forge Last update tanked speed

No settings changed: flux-dev on a RTX 3090, async/shared 21500 GPU weights, this used to give me 1.97s/it, now dropped to 14.4s/it and it's very fluctuating between that and even slower speeds. What happened?

Aug 29 '24 07:08 PixelClassicist

3090 here, same settings + Queue and Shared and Diffusion in low bits set to Automatic (Lora in fp16). After updating to the latest commit today, speed went down indeed and now the generation crashes everytime I use a LORA at the part: Unloading Kmodel.

Aug 29 '24 07:08 davethen00b

3090 here, same settings + Queue and Shared and Diffusion in low bits set to Automatic (Lora in fp16). After updating to the latest commit today, speed went down indeed and now the generation crashes everytime I use a LORA at the part: Unloading Kmodel.

Yes lots of crashed here at that point aswell

Aug 29 '24 08:08 PixelClassicist

Here too, I used GGUF Q4 and the maximum time was about 3s/steps. Now the minimum is 25/28s. It's as if it had locked up

Aug 29 '24 09:08 danilomaiaweb

It will definitely be fixed and back to normal soon....

Aug 29 '24 09:08 danilomaiaweb

So, what is the last commit that works like should?

Aug 29 '24 09:08 ZeroCool22

The several responses in this thread have completely different problems. Do not use this issue to crowd unrelated problems.
For crash during inference/lora, this link solve 100% problems: https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/1474 Herein, the 100% is a real number. We have been fighting with this exactly problem for 2 years and it worked in 100% times.
For inference being slow down, if you do not provide full console log, your report will be likely ignored. By the way, it will be better if you have commits that is not slow, and provide two console logs for a before-after comparison.
Please do note that, unlike original Automatic1111 webui, Forge do solve performance problems and users will be super happy in most cases after the problem is solved. A typical procedure how we solve user reports: https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/1502 This requires your corporation to follow the instructions in (3). Nevertheless, if you do not follow (3), your reports are likely to be ignored because we will have no idea what happened to you.

Please read this and paste full console log

Aug 29 '24 15:08 lllyasviel

same i'm using bnb nf4 i kinda noticed it too!

Aug 30 '24 03:08 moudahaddad14

The several responses in this thread have completely different problems. Do not use this issue to crowd unrelated problems.

For crash during inference/lora, this link solve 100% problems: How to solve "Connection errored out" / "Press anykey to continue ..." / etc #1474 Herein, the 100% is a real number. We have been fighting with this exactly problem for 2 years and it worked in 100% times.

For inference being slow down, if you do not provide full console log, your report will be likely ignored. By the way, it will be better if you have commits that is not slow, and provide two console logs for a before-after comparison.

Please do note that, unlike original Automatic1111 webui, Forge do solve performance problems and users will be super happy in most cases after the problem is solved. A typical procedure how we solve user reports: GGUF Q8_0 became much slower #1502 This requires your corporation to follow the instructions in (3). Nevertheless, if you do not follow (3), your reports are likely to be ignored because we will have no idea what happened to you.

Please read this and paste full console log

Sorry about that, here's the info:

[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 30867.09 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 21322.09 MB, Model Require: 159.87 MB, Inference Require: 4034.00 MB, Remaining: 17128.22 MB, All loaded to GPU. Moving model(s) has taken 1.19 seconds [LORA] Loaded C:\Users\Forge\Forge_Flux\webui\models\Lora\Lora_People\Real People\wednesday_addams_flux_lora_v1_000001600.safetensors for KModel-UNet with 494 keys at weight 1.0 (skipped 0 keys) Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16568.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 21148.98 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 21148.98 MB, Model Require: 9641.98 MB, Inference Require: 4034.00 MB, Remaining: 7473.00 MB, All loaded to GPU. Moving model(s) has taken 4.00 seconds Distilled CFG Scale: 3 Reuse 1 loaded models [Unload] Trying to free 33544.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 449.58 MB ... Unload model IntegratedAutoencoderKL Current free memory is 613.69 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 8637.29 MB, Model Require: 20808.00 MB, Inference Require: 4034.00 MB, Remaining: -16204.71 MB, Shared Swap Loaded (asynchronous method): 19188.00 MB, GPU Loaded: 3512.13 MB Patched LoRAs by precomputing model weights; Moving model(s) has taken 41.20 seconds 68%|███████████████████████████████████████████████████████▊ | 34/50 [02:20<00:55, 3.44s/it] Total progress: 1%|▌ | 43/5000 [07:22<4:44:14, 3.44s/i

Last night's update did increase the speed, but it's still double of what I was reaching before at about 1.97s/it. I don't have a before console log, sorry about that.

Aug 30 '24 08:08 PixelClassicist

@PixelClassicist this log shows severe memory leak problem. do you have full log from beginning.

Aug 30 '24 08:08 lllyasviel

@PixelClassicist this log shows severe memory leak problem. do you have full log from beginning.

Yes, here you go. Thanks for helping out and taking a look.

venv "C:\Users\Forge\Forge_Flux\webui\venv\Scripts\Python.exe" Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Version: f2.0.1v1.10.1-previous-469-g4c9380c4 Commit hash: 4c9380c46ab9e046404ed2d068c6132e90661fbe Launching Web UI with arguments: Total VRAM 24575 MB, total RAM 130749 MB pytorch version: 2.3.1+cu121 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 3090 : native Hint: your device supports --cuda-malloc for potential speed improvements. VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: False Using pytorch cross attention Using pytorch attention for VAE ControlNet preprocessor location: C:\Users\Forge\Forge_Flux\webui\models\ControlNetPreprocessor CivitAI Browser+: Aria2 RPC started Using sqlite file: C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\task_scheduler.sqlite3 *** Error loading script: task_scheduler.py Traceback (most recent call last): File "C:\Users\Forge\Forge_Flux\webui\modules\scripts.py", line 525, in load_scripts script_module = script_loading.load_module(scriptfile.path) File "C:\Users\Forge\Forge_Flux\webui\modules\script_loading.py", line 13, in load_module module_spec.loader.exec_module(module) File "", line 883, in exec_module File "", line 241, in call_with_frames_removed File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\scripts\task_scheduler.py", line 24, in from agent_scheduler.task_runner import TaskRunner, get_instance File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 26, in from .db import TaskStatus, Task, task_manager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db_init.py", line 6, in from .task import TaskStatus, Task, TaskManager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db\task.py", line 22, in from ..models import TaskModel File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 53, in class Txt2ImgApiTaskArgs(StableDiffusionTxt2ImgProcessingAPI): File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 71, in Txt2ImgApiTaskArgs class Config(StableDiffusionTxt2ImgProcessingAPI.config): File "C:\Users\Forge\Forge_Flux\webui\venv\lib\site-packages\pydantic_internal_model_construction.py", line 237, in getattr raise AttributeError(item) AttributeError: config

2024-08-30 10:27:43,852 - ControlNet - INFO - ControlNet UI callback registered. C:\Users\Forge\Forge_Flux\webui\extensions\sd-civitai-browser-plus\scripts\civitai_gui.py:204: GradioDeprecationWarning: unexpected argument for Button: label refresh = gr.Button(label="", value="", elem_id=refreshbtn, icon="placeholder") Model selected: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} Using online LoRAs in FP16: False Running on local URL: http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`. Startup time: 23.6s (prepare environment: 4.7s, import torch: 9.4s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 3.2s, create ui: 3.3s, gradio launch: 2.0s). Environment vars changed: {'stream': True, 'inference_memory': 4034.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.58% GPU memory (20541.00 MB) to load weights, and use 16.42% GPU memory (4034.00 MB) to do matrix computation. Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False} Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}

[Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights. [Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] Make sure that you know what you are testing. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] Make sure that you know what you are testing.

Environment vars changed: {'stream': True, 'inference_memory': 4034.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.58% GPU memory (20541.00 MB) to load weights, and use 16.42% GPU memory (4034.00 MB) to do matrix computation. Environment vars changed: {'stream': True, 'inference_memory': 4034.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.58% GPU memory (20541.00 MB) to load weights, and use 16.42% GPU memory (4034.00 MB) to do matrix computation. Loading Model: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} [Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done. StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0} Using Default T5 Data Type: torch.float16 Working with z of shape (1, 16, 32, 32) = 16384 dimensions. K-Model Created: {'storage_dtype': torch.bfloat16, 'computation_dtype': torch.bfloat16} Model loaded in 1.8s (unload existing model: 0.3s, forge model load: 1.5s). INFO:sd_dynamic_prompts.dynamic_prompting:Prompt matrix will create 100 images in a total of 100 batches. Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16474.34 MB for cuda:0 with 0 models keep loaded ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 23293.00 MB, Model Require: 9569.49 MB, Inference Require: 4034.00 MB, Remaining: 9689.51 MB, All loaded to GPU. Moving model(s) has taken 13.84 seconds Distilled CFG Scale: 3 [Unload] Trying to free 33544.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 12346.93 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 21430.53 MB, Model Require: 22700.13 MB, Inference Require: 4034.00 MB, Remaining: -5303.60 MB, Shared Swap Loaded (asynchronous method): 6624.00 MB, GPU Loaded: 16076.13 MB Moving model(s) has taken 21.69 seconds 12%|█████████▉ | 6/50 [00:56<07:34, 10.33s/it]Environment vars changed: {'stream': True, 'inference_memory': 4034.0, 'pin_shared_memory': True}44<12:24:40, 8.95s/it] [GPU Setting] You will use 83.58% GPU memory (20541.00 MB) to load weights, and use 16.42% GPU memory (4034.00 MB) to do matrix computation. 14%|███████████▌ | 7/50 [01:07<07:41, 10.73s/it]Environment vars changed: {'stream': True, 'inference_memory': 3075.0, 'pin_shared_memory': True}56<13:29:19, 9.73s/it] [GPU Setting] You will use 87.49% GPU memory (21500.00 MB) to load weights, and use 12.51% GPU memory (3075.00 MB) to do matrix computation. 28%|██████████████████████▉ | 14/50 [02:37<06:44, 11.23s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 9345.08 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7122.08 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 3887.21 MB, All loaded to GPU. Moving model(s) has taken 0.28 seconds [LORA] Loaded C:\Users\Forge\Forge_Flux\webui\models\Lora\Lora_People\Real People\Halle_Berry.safetensors for KModel-UNet with 304 keys at weight 1.0 (skipped 0 keys) Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 6950.19 MB ... Unload model KModel Current free memory is 23026.37 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 23026.37 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 10309.39 MB, All loaded to GPU. Moving model(s) has taken 13.14 seconds Distilled CFG Scale: 3 [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1061.21 MB ... Unload model IntegratedAutoencoderKL Current free memory is 1226.96 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 11086.53 MB, Model Require: 22700.13 MB, Inference Require: 3075.00 MB, Remaining: -14688.61 MB, Shared Swap Loaded (asynchronous method): 15984.00 MB, GPU Loaded: 6716.13 MB Patched LoRAs by precomputing model weights; Moving model(s) has taken 47.72 seconds 0%| | 0/50 [00:00<?, ?it/s] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 16072.89 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 16464.87 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 13230.00 MB, All loaded to GPU. Moving model(s) has taken 0.29 seconds Total progress: 0%|▏ | 14/5000 [03:31<20:53:47, 15.09s/it] [Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Current free memory is 16304.25 MB ... Unload model KModel Current free memory is 23020.43 MB ... Unload model IntegratedAutoencoderKL Done. INFO:sd_dynamic_prompts.dynamic_prompting:Prompt matrix will create 100 images in a total of 100 batches. [LORA] Loaded C:\Users\Forge\Forge_Flux\webui\models\Lora\Lora_People\Real People\eva_green_flux_lora_v1_000001200.safetensors for KModel-UNet with 494 keys at weight 1.0 (skipped 0 keys) Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 23182.53 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 10465.55 MB, All loaded to GPU. Moving model(s) has taken 3.83 seconds Distilled CFG Scale: 3 [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 637.90 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 9984.96 MB, Model Require: 22700.13 MB, Inference Require: 3075.00 MB, Remaining: -15790.17 MB, Shared Swap Loaded (asynchronous method): 17136.00 MB, GPU Loaded: 5564.13 MB Patched LoRAs by precomputing model weights; Moving model(s) has taken 55.88 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [02:37<00:00, 3.15s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 17361.30 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 17601.39 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 14366.51 MB, All loaded to GPU. Moving model(s) has taken 0.20 seconds Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 17441.19 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 17441.19 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 4724.21 MB, All loaded to GPU. Moving model(s) has taken 3.73 seconds Distilled CFG Scale: 3 Reuse 1 loaded models [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 13782.38 MB ... Unload model IntegratedAutoencoderKL Current free memory is 13944.50 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 17279.90 MB, Model Require: 17136.00 MB, Inference Require: 3075.00 MB, Remaining: -2931.10 MB, Shared Swap Loaded (asynchronous method): 9792.00 MB, GPU Loaded: 12908.13 MB Moving model(s) has taken 15.29 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [05:38<00:00, 6.77s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 11578.93 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 10251.93 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 7017.06 MB, All loaded to GPU. Moving model(s) has taken 0.18 seconds activating extra network lora with arguments [<modules.extra_networks.ExtraNetworkParams object at 0x000001CC09684BE0>, <modules.extra_networks.ExtraNetworkParams object at 0x000001CC081B5A20>]: AttributeError Traceback (most recent call last): File "C:\Users\Forge\Forge_Flux\webui\extensions-builtin\sd_forge_lora\networks.py", line 92, in load_networks net = load_network(name, network_on_disk) File "C:\Users\Forge\Forge_Flux\webui\extensions-builtin\sd_forge_lora\networks.py", line 61, in load_network net.mtime = os.path.getmtime(network_on_disk.filename) AttributeError: 'NoneType' object has no attribute 'filename'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Forge\Forge_Flux\webui\modules\extra_networks.py", line 135, in activate extra_network.activate(p, extra_network_args) File "C:\Users\Forge\Forge_Flux\webui\extensions-builtin\sd_forge_lora\extra_networks_lora.py", line 45, in activate networks.load_networks(names, te_multipliers, unet_multipliers, dyn_dims) File "C:\Users\Forge\Forge_Flux\webui\extensions-builtin\sd_forge_lora\networks.py", line 94, in load_networks errors.display(e, f"loading network {network_on_disk.filename}") AttributeError: 'NoneType' object has no attribute 'filename'

Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10091.31 MB ... Unload model KModel Current free memory is 22999.48 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 22999.48 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 10282.50 MB, All loaded to GPU. Moving model(s) has taken 12.04 seconds Distilled CFG Scale: 3 [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 352.86 MB ... Unload model IntegratedAutoencoderKL Current free memory is 514.96 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 9589.18 MB, Model Require: 22700.13 MB, Inference Require: 3075.00 MB, Remaining: -16185.96 MB, Shared Swap Loaded (asynchronous method): 17496.00 MB, GPU Loaded: 5204.13 MB Moving model(s) has taken 8.97 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [17:39<00:00, 21.18s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 28730.08 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 17949.08 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 14714.21 MB, All loaded to GPU. Moving model(s) has taken 1.00 seconds [LORA] Loaded C:\Users\Forge\Forge_Flux\webui\models\Lora\Lora_People\Real People\SashaGrey.safetensors for KModel-UNet with 304 keys at weight 1.0 (skipped 0 keys) Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 17787.70 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 17787.70 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 5070.72 MB, All loaded to GPU. Moving model(s) has taken 3.28 seconds Distilled CFG Scale: 3 Reuse 1 loaded models [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 666.56 MB ... Unload model IntegratedAutoencoderKL Current free memory is 830.38 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 4883.74 MB, Model Require: 17496.00 MB, Inference Require: 3075.00 MB, Remaining: -15687.26 MB, Shared Swap Loaded (asynchronous method): 21348.00 MB, GPU Loaded: 1352.13 MB Patched LoRAs by precomputing model weights; Moving model(s) has taken 42.14 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [03:14<00:00, 3.89s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 21786.94 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 21786.94 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 18552.06 MB, All loaded to GPU. Moving model(s) has taken 0.18 seconds Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 15609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 21626.16 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 21626.16 MB, Model Require: 9641.98 MB, Inference Require: 3075.00 MB, Remaining: 8909.18 MB, All loaded to GPU. Moving model(s) has taken 2.89 seconds Distilled CFG Scale: 3 Reuse 1 loaded models [Unload] Trying to free 32585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 463.69 MB ... Unload model IntegratedAutoencoderKL Current free memory is 626.48 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 9845.37 MB, Model Require: 21348.00 MB, Inference Require: 3075.00 MB, Remaining: -14577.63 MB, Shared Swap Loaded (asynchronous method): 17280.00 MB, GPU Loaded: 5420.13 MB Moving model(s) has taken 14.99 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [17:43<00:00, 21.27s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 23706.08 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 17719.08 MB, Model Require: 159.87 MB, Inference Require: 3075.00 MB, Remaining: 14484.21 MB, All loaded to GPU. Moving model(s) has taken 0.81 seconds

Aug 30 '24 10:08 PixelClassicist

@PixelClassicist Update and try again

Aug 30 '24 16:08 lllyasviel

Just wanted to say thank you, that last push had a significant impact on some of the performance issues I was experiencing.

Aug 30 '24 22:08 kharlamov

@PixelClassicist Update and try again

Sadly, there doesn't seem to be much of a difference. At first I thought lora's might have been tnking the speed, but this is without lora's on the latest update:

venv "C:\Users\Forge\Forge_Flux\webui\venv\Scripts\Python.exe" Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Version: f2.0.1v1.10.1-previous-501-g668e87f9 Commit hash: 668e87f920be30001bb87214d9001bf59f2da275 Launching Web UI with arguments: Total VRAM 24575 MB, total RAM 130749 MB pytorch version: 2.3.1+cu121 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 3090 : native Hint: your device supports --cuda-malloc for potential speed improvements. VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: False Using pytorch cross attention Using pytorch attention for VAE ControlNet preprocessor location: C:\Users\Forge\Forge_Flux\webui\models\ControlNetPreprocessor CivitAI Browser+: Aria2 RPC started Using sqlite file: C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\task_scheduler.sqlite3 *** Error loading script: task_scheduler.py Traceback (most recent call last): File "C:\Users\Forge\Forge_Flux\webui\modules\scripts.py", line 525, in load_scripts script_module = script_loading.load_module(scriptfile.path) File "C:\Users\Forge\Forge_Flux\webui\modules\script_loading.py", line 13, in load_module module_spec.loader.exec_module(module) File "", line 883, in exec_module File "", line 241, in call_with_frames_removed File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\scripts\task_scheduler.py", line 24, in from agent_scheduler.task_runner import TaskRunner, get_instance File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 26, in from .db import TaskStatus, Task, task_manager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db_init.py", line 6, in from .task import TaskStatus, Task, TaskManager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db\task.py", line 22, in from ..models import TaskModel File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 53, in class Txt2ImgApiTaskArgs(StableDiffusionTxt2ImgProcessingAPI): File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 71, in Txt2ImgApiTaskArgs class Config(StableDiffusionTxt2ImgProcessingAPI.config): File "C:\Users\Forge\Forge_Flux\webui\venv\lib\site-packages\pydantic_internal_model_construction.py", line 237, in getattr raise AttributeError(item) AttributeError: config

2024-09-03 19:22:15,065 - ControlNet - INFO - ControlNet UI callback registered. C:\Users\Forge\Forge_Flux\webui\extensions\sd-civitai-browser-plus\scripts\civitai_gui.py:204: GradioDeprecationWarning: unexpected argument for Button: label refresh = gr.Button(label="", value="", elem_id=refreshbtn, icon="placeholder") Model selected: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} Using online LoRAs in FP16: False Running on local URL: http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`. Startup time: 22.2s (prepare environment: 4.4s, import torch: 8.7s, initialize shared: 0.2s, other imports: 0.6s, load scripts: 3.1s, create ui: 3.2s, gradio launch: 2.0s). Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False} Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}

[Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights. [Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] To solve the problem, you can set the 'GPU Weights' (on the top of page) to a lower value. [Low VRAM Warning] To solve the problem, you can set the 'GPU Weights' (on the top of page) to a lower value. [Low VRAM Warning] If you cannot find 'GPU Weights', you can click the 'all' option in the 'UI' area on the left-top corner of the webpage. [Low VRAM Warning] If you cannot find 'GPU Weights', you can click the 'all' option in the 'UI' area on the left-top corner of the webpage. [Low VRAM Warning] Make sure that you know what you are testing. [Low VRAM Warning] Make sure that you know what you are testing.

Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Loading Model: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} [Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done. StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0} Using Default T5 Data Type: torch.float16 Working with z of shape (1, 16, 32, 32) = 16384 dimensions. K-Model Created: {'storage_dtype': torch.bfloat16, 'computation_dtype': torch.bfloat16} Model loaded in 1.0s (unload existing model: 0.3s, forge model load: 0.7s). INFO:sd_dynamic_prompts.dynamic_prompting:Prompt matrix will create 100 images in a total of 100 batches. Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16515.34 MB for cuda:0 with 0 models keep loaded ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 23293.00 MB, Model Require: 9569.49 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 9648.51 MB, All loaded to GPU. Moving model(s) has taken 7.08 seconds Distilled CFG Scale: 3 [Unload] Trying to free 33585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 350.43 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 6870.03 MB, Model Require: 22700.13 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: -19905.10 MB, Shared Swap Loaded (asynchronous method): 20592.00 MB, GPU Loaded: 2108.13 MB Moving model(s) has taken 19.42 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [13:13<00:00, 26.46s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 31269.28 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 21088.28 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 16853.40 MB, All loaded to GPU. Moving model(s) has taken 1.00 seconds Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 20916.32 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 20916.32 MB, Model Require: 9641.98 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 7199.34 MB, All loaded to GPU. Moving model(s) has taken 4.23 seconds Distilled CFG Scale: 3 [Unload] Trying to free 4075.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 387.21 MB ... Unload model IntegratedAutoencoderKL Current free memory is 551.03 MB ... Unload model JointTextEncoder Current free memory is 10203.50 MB ... Done. Memory cleanup has taken 5.25 seconds 10%|████████▎ | 3/30 [01:03<10:51, 24.13s/it] Total progress: 1%|▋ | 33/3000 [14:37<26:29:59, 32.15s/it]

Sep 03 '24 17:09 PixelClassicist

@PixelClassicist Update and try again

Sadly, there doesn't seem to be much of a difference. At first I thought lora's might have been tnking the speed, but this is without lora's on the latest update:

venv "C:\Users\Forge\Forge_Flux\webui\venv\Scripts\Python.exe" Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Version: f2.0.1v1.10.1-previous-501-g668e87f9 Commit hash: 668e87f Launching Web UI with arguments: Total VRAM 24575 MB, total RAM 130749 MB pytorch version: 2.3.1+cu121 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 3090 : native Hint: your device supports --cuda-malloc for potential speed improvements. VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: False Using pytorch cross attention Using pytorch attention for VAE ControlNet preprocessor location: C:\Users\Forge\Forge_Flux\webui\models\ControlNetPreprocessor CivitAI Browser+: Aria2 RPC started Using sqlite file: C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\task_scheduler.sqlite3 *** Error loading script: task_scheduler.py Traceback (most recent call last): File "C:\Users\Forge\Forge_Flux\webui\modules\scripts.py", line 525, in load_scripts script_module = script_loading.load_module(scriptfile.path) File "C:\Users\Forge\Forge_Flux\webui\modules\script_loading.py", line 13, in load_module module_spec.loader.exec_module(module) File "", line 883, in exec_module File "", line 241, in call_with_frames_removed File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\scripts\task_scheduler.py", line 24, in from agent_scheduler.task_runner import TaskRunner, get_instance File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 26, in from .db import TaskStatus, Task, task_manager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db__init_.py", line 6, in from .task import TaskStatus, Task, TaskManager File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\db\task.py", line 22, in from ..models import TaskModel File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 53, in class Txt2ImgApiTaskArgs(StableDiffusionTxt2ImgProcessingAPI): File "C:\Users\Forge\Forge_Flux\webui\extensions\sd-webui-agent-scheduler\agent_scheduler\models.py", line 71, in Txt2ImgApiTaskArgs class Config(StableDiffusionTxt2ImgProcessingAPI.config): File "C:\Users\Forge\Forge_Flux\webui\venv\lib\site-packages\pydantic_internal_model_construction.py", line 237, in getattr raise AttributeError(item) AttributeError: config

2024-09-03 19:22:15,065 - ControlNet - INFO - ControlNet UI callback registered. C:\Users\Forge\Forge_Flux\webui\extensions\sd-civitai-browser-plus\scripts\civitai_gui.py:204: GradioDeprecationWarning: unexpected argument for Button: label refresh = gr.Button(label="", value="", elem_id=refreshbtn, icon="placeholder") Model selected: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} Using online LoRAs in FP16: False Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().

Startup time: 22.2s (prepare environment: 4.4s, import torch: 8.7s, initialize shared: 0.2s, other imports: 0.6s, load scripts: 3.1s, create ui: 3.2s, gradio launch: 2.0s). Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False} Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}

[Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights.

[Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] This means you will have 0% GPU memory (0.00 MB) to do matrix computation. Computations may fallback to CPU or go Out of Memory. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] In many cases, image generation will be 10x slower. [Low VRAM Warning] To solve the problem, you can set the 'GPU Weights' (on the top of page) to a lower value. [Low VRAM Warning] To solve the problem, you can set the 'GPU Weights' (on the top of page) to a lower value. [Low VRAM Warning] If you cannot find 'GPU Weights', you can click the 'all' option in the 'UI' area on the left-top corner of the webpage. [Low VRAM Warning] If you cannot find 'GPU Weights', you can click the 'all' option in the 'UI' area on the left-top corner of the webpage. [Low VRAM Warning] Make sure that you know what you are testing. [Low VRAM Warning] Make sure that you know what you are testing. Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Environment vars changed: {'stream': True, 'inference_memory': 4075.0, 'pin_shared_memory': True} [GPU Setting] You will use 83.42% GPU memory (20500.00 MB) to load weights, and use 16.58% GPU memory (4075.00 MB) to do matrix computation. Loading Model: {'checkpoint_info': {'filename': 'C:\Users\Forge\Forge_Flux\webui\models\Stable-diffusion\Flux\flux1-dev.safetensors', 'hash': 'b04b3ba1'}, 'additional_modules': ['C:\Users\Forge\Forge_Flux\webui\models\VAE\ae.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\clip_l.safetensors', 'C:\Users\Forge\Forge_Flux\webui\models\text_encoder\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None} [Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done. StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0} Using Default T5 Data Type: torch.float16 Working with z of shape (1, 16, 32, 32) = 16384 dimensions. K-Model Created: {'storage_dtype': torch.bfloat16, 'computation_dtype': torch.bfloat16} Model loaded in 1.0s (unload existing model: 0.3s, forge model load: 0.7s). INFO:sd_dynamic_prompts.dynamic_prompting:Prompt matrix will create 100 images in a total of 100 batches. Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16515.34 MB for cuda:0 with 0 models keep loaded ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 23293.00 MB, Model Require: 9569.49 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 9648.51 MB, All loaded to GPU. Moving model(s) has taken 7.08 seconds Distilled CFG Scale: 3 [Unload] Trying to free 33585.18 MB for cuda:0 with 0 models keep loaded ... Current free memory is 350.43 MB ... Unload model JointTextEncoder Done. [Memory Management] Target: KModel, Free GPU: 6870.03 MB, Model Require: 22700.13 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: -19905.10 MB, Shared Swap Loaded (asynchronous method): 20592.00 MB, GPU Loaded: 2108.13 MB Moving model(s) has taken 19.42 seconds 100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [13:13<00:00, 26.46s/it] [Unload] Trying to free 6668.46 MB for cuda:0 with 0 models keep loaded ... Current free memory is 31269.28 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 21088.28 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 16853.40 MB, All loaded to GPU. Moving model(s) has taken 1.00 seconds Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored. [Unload] Trying to free 16609.57 MB for cuda:0 with 0 models keep loaded ... Current free memory is 20916.32 MB ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 20916.32 MB, Model Require: 9641.98 MB, Previously Loaded: 0.00 MB, Inference Require: 4075.00 MB, Remaining: 7199.34 MB, All loaded to GPU. Moving model(s) has taken 4.23 seconds Distilled CFG Scale: 3 [Unload] Trying to free 4075.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 387.21 MB ... Unload model IntegratedAutoencoderKL Current free memory is 551.03 MB ... Unload model JointTextEncoder Current free memory is 10203.50 MB ... Done. Memory cleanup has taken 5.25 seconds 10%|████████▎ | 3/30 [01:03<10:51, 24.13s/it] Total progress: 1%|▋ | 33/3000 [14:37<26:29:59, 32.15s/it]

this is the reason i DONT UPDATE until waaaaaayy after they are released..... i did updated, and i repent now....... bummer......

Sep 07 '24 01:09 LIQUIDMIND111

@PixelClassicist Update and try again

@lllyasviel sadly its happening to MANY of us still today AFTER all updates, loading is slower and inference take longer after update, also LORA PATCHING PROGRESS BAR is missing, making the LOADING of models seems its HANGING insteaad of LOADING LORAS.... is there a way to REVERT then?

Sep 07 '24 01:09 LIQUIDMIND111

This problem still persists btw.. logs are provided above. Only way forge will work now is when I set it to Queue and Shared, at which point only 13,4 GB VRAM will be used of the 24GB available resulting in about 2.65s/it. Which is not slow, but for a 3090 too slow. Whenever I set it to Async (no matter Shared or CPU),the usage will spike to 100% VRAM. No matter what GPU weights I set, they are totally ignored. At which point generation will drop to about 20s/it.

Sep 23 '24 05:09 PixelClassicist

Last update tanked speed

To create a public link, set share=True in launch().

[Low VRAM Warning] You just set Forge to use 100% GPU memory (23551.00 MB) to load model weights.

To create a public link, set `share=True` in `launch()`.