stable-diffusion-webui-amdgpu-forge Generation slowed and unable to hi-res images

After a forced reinstallation of Windows, reinstalled this fork and faced with the fact that the generation speed fell almost three times, and when trying to hi-rez completely “dies” video card, although before there were no such problems. The same situation was on the last version of the system a few days ago, when I needed to create a clone of the fork - the generation speed was incredibly slow, any attempts to enable adetailer or hi-res would crash the system.

GPU: Radeon 7800xt CPU: AMD Ryzen 5 7500F RAM: 64 GB HIP: 6.1 Drivers: 24.8.1

The key issue is that the video card memory becomes overloaded, causing a critical failure in its operation. In previous versions, such a problem never occurred, but my previous version was on Torch 2.3.1, not 2.6.0 — could this be the issue? I’m not very knowledgeable about programming and related topics, so I’m asking for advice.

fatal: No names found, cannot describe anything. Python 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] Version: f2.0.1v1.10.1-1.10.1 Commit hash: e07be6a48fc0ae1840b78d5e55ee36ab78396b30 ROCm: agents=['gfx1101'] ROCm: version=6.2, using agent gfx1101 ZLUDA support: experimental ZLUDA load: path='H:\stable-diffusion-webui-amdgpu-forge.zluda' nightly=False Skipping onnxruntime installation. Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run pip install insightface manually. Launching Web UI with arguments: --skip-ort --use-zluda Total VRAM 16368 MB, total RAM 65112 MB pytorch version: 2.6.0+cu118 Set vram state to: NORMAL_VRAM Device: cuda:0 AMD Radeon RX 7800 XT [ZLUDA] : native VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: False Using pytorch cross attention Using pytorch attention for VAE ControlNet preprocessor location: H:\stable-diffusion-webui-amdgpu-forge\models\ControlNetPreprocessor 2025-04-27 01:06:08,751 - ControlNet - INFO - ControlNet UI callback registered. Model selected: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui-amdgpu-forge\models\Stable-diffusion\noobaiXLNAIXL_epsilonPred11Version.safetensors', 'hash': '1ce6b882'}, 'additional_modules': [], 'unet_storage_dtype': None} Using online LoRAs in FP16: False Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). Startup time: 9.0s (prepare environment: 0.8s, launcher: 0.2s, import torch: 4.7s, initialize shared: 0.2s, other imports: 0.1s, load scripts: 1.0s, create ui: 1.3s, gradio launch: 0.6s). Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False} [GPU Setting] You will use 93.74% GPU memory (15344.00 MB) to load weights, and use 6.26% GPU memory (1024.00 MB) to do matrix computation. Loading Model: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui-amdgpu-forge\models\Stable-diffusion\noobaiXLNAIXL_epsilonPred11Version.safetensors', 'hash': '1ce6b882'}, 'additional_modules': [], 'unet_storage_dtype': None} [Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done. StateDict Keys: {'unet': 1680, 'vae': 248, 'text_encoder': 196, 'text_encoder_2': 518, 'ignore': 0} Working with z of shape (1, 4, 32, 32) = 4096 dimensions. K-Model Created: {'storage_dtype': torch.float16, 'computation_dtype': torch.float16} Model loaded in 2.3s (unload existing model: 0.2s, forge model load: 2.0s). [Unload] Trying to free 3051.58 MB for cuda:0 with 0 models keep loaded ... Done. [Memory Management] Target: JointTextEncoder, Free GPU: 13891.45 MB, Model Require: 1559.68 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 11307.78 MB, All loaded to GPU. Moving model(s) has taken 0.79 seconds [Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 12126.80 MB ... Done. [Unload] Trying to free 7656.40 MB for cuda:0 with 0 models keep loaded ... Current free memory is 12127.94 MB ... Done. [Memory Management] Target: KModel, Free GPU: 12127.94 MB, Model Require: 4897.05 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 6206.89 MB, All loaded to GPU. Moving model(s) has taken 2.46 seconds Compilation is in progress. Please wait... 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00, 1.09it/s] [Unload] Trying to free 4495.36 MB for cuda:0 with 0 models keep loaded ... Current free memory is 7036.98 MB ... Done. [Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7028.98 MB, Model Require: 159.56 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5845.43 MB, All loaded to GPU. Moving model(s) has taken 0.22 seconds Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00, 1.23s/it] Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00, 1.12it/s]

Apr 26 '25 22:04 Kizirugai

Hey you have Python 3.11 installed but 3.10.11 64bit is the recommended Version. Uninstall 3.11 and install 3.10.11 and then delete the venv folder and relaunch the webui-user.bat Also make sure that Wallpaper Engine is Disabled if you have it and also enable the "Never OOM" "For Tiled VAE only" at the bottom of txt2img when using hires fix. You can also add --cuda-stream --attention-quad to the commandline_args to get a better performance.

May 09 '25 18:05 CS1o

Hi could this be related to the recent issues on latest? Normal generation works fine with batches of 4. Gets out of memory saying pytorch tried to alloc ~32GB of vram when it tries hires fix. Could do 4x batches w hires fix and adetailer easily and fast in the past, now completely broken, please help!

https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu-forge/issues/105

Jul 30 '25 14:07 slurmicus