Generation slowed and unable to hi-res images
After a forced reinstallation of Windows, reinstalled this fork and faced with the fact that the generation speed fell almost three times, and when trying to hi-rez completely “dies” video card, although before there were no such problems. The same situation was on the last version of the system a few days ago, when I needed to create a clone of the fork - the generation speed was incredibly slow, any attempts to enable adetailer or hi-res would crash the system.
GPU: Radeon 7800xt CPU: AMD Ryzen 5 7500F RAM: 64 GB HIP: 6.1 Drivers: 24.8.1
The key issue is that the video card memory becomes overloaded, causing a critical failure in its operation. In previous versions, such a problem never occurred, but my previous version was on Torch 2.3.1, not 2.6.0 — could this be the issue? I’m not very knowledgeable about programming and related topics, so I’m asking for advice.
fatal: No names found, cannot describe anything.
Python 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Version: f2.0.1v1.10.1-1.10.1
Commit hash: e07be6a48fc0ae1840b78d5e55ee36ab78396b30
ROCm: agents=['gfx1101']
ROCm: version=6.2, using agent gfx1101
ZLUDA support: experimental
ZLUDA load: path='H:\stable-diffusion-webui-amdgpu-forge.zluda' nightly=False
Skipping onnxruntime installation.
Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run pip install insightface manually.
Launching Web UI with arguments: --skip-ort --use-zluda
Total VRAM 16368 MB, total RAM 65112 MB
pytorch version: 2.6.0+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7800 XT [ZLUDA] : native
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: H:\stable-diffusion-webui-amdgpu-forge\models\ControlNetPreprocessor
2025-04-27 01:06:08,751 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui-amdgpu-forge\models\Stable-diffusion\noobaiXLNAIXL_epsilonPred11Version.safetensors', 'hash': '1ce6b882'}, 'additional_modules': [], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True in launch().
Startup time: 9.0s (prepare environment: 0.8s, launcher: 0.2s, import torch: 4.7s, initialize shared: 0.2s, other imports: 0.1s, load scripts: 1.0s, create ui: 1.3s, gradio launch: 0.6s).
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 93.74% GPU memory (15344.00 MB) to load weights, and use 6.26% GPU memory (1024.00 MB) to do matrix computation.
Loading Model: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui-amdgpu-forge\models\Stable-diffusion\noobaiXLNAIXL_epsilonPred11Version.safetensors', 'hash': '1ce6b882'}, 'additional_modules': [], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'unet': 1680, 'vae': 248, 'text_encoder': 196, 'text_encoder_2': 518, 'ignore': 0}
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
K-Model Created: {'storage_dtype': torch.float16, 'computation_dtype': torch.float16}
Model loaded in 2.3s (unload existing model: 0.2s, forge model load: 2.0s).
[Unload] Trying to free 3051.58 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 13891.45 MB, Model Require: 1559.68 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 11307.78 MB, All loaded to GPU.
Moving model(s) has taken 0.79 seconds
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 12126.80 MB ... Done.
[Unload] Trying to free 7656.40 MB for cuda:0 with 0 models keep loaded ... Current free memory is 12127.94 MB ... Done.
[Memory Management] Target: KModel, Free GPU: 12127.94 MB, Model Require: 4897.05 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 6206.89 MB, All loaded to GPU.
Moving model(s) has taken 2.46 seconds
Compilation is in progress. Please wait...
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00, 1.09it/s]
[Unload] Trying to free 4495.36 MB for cuda:0 with 0 models keep loaded ... Current free memory is 7036.98 MB ... Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7028.98 MB, Model Require: 159.56 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5845.43 MB, All loaded to GPU.
Moving model(s) has taken 0.22 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00, 1.23s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00, 1.12it/s]
Hey you have Python 3.11 installed but 3.10.11 64bit is the recommended Version.
Uninstall 3.11 and install 3.10.11 and then delete the venv folder and relaunch the webui-user.bat
Also make sure that Wallpaper Engine is Disabled if you have it and also enable the "Never OOM" "For Tiled VAE only" at the bottom of txt2img when using hires fix.
You can also add --cuda-stream --attention-quad to the commandline_args to get a better performance.
Hi could this be related to the recent issues on latest? Normal generation works fine with batches of 4. Gets out of memory saying pytorch tried to alloc ~32GB of vram when it tries hires fix. Could do 4x batches w hires fix and adetailer easily and fast in the past, now completely broken, please help!
https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu-forge/issues/105