Amd-gpu-forge : Memory leaks and horribly slow performance on latest version, Hip SDK 6.2, AMD 7900xtx
Hi, I used to be able to do batches of 4 images with hires fix and adetailer with no problems in only a few minutes, using the older version of Amd-gpu Forge with Hip SDK 5.7.
Since upgrading to the latest version of amd-gpu forge ( along with Hip SDK to 6.2 ), generation is horribly slow even for one image.
- the initial generation and the "tiled upscale" parts go pretty fast ( 21 seconds in this example, but the next thing it does after the tiled upscale takes forever ( see output ).
- when I try to do a batch of 4, it completely fails with an Out of Memory exception every time.
[Troubleshooting]
- I tried reverting to Hip SDK 5.7 but Amd-gpu forge no longer works with that version ( now requires amdhip64.dll to be located in the sdk folder, etc, )
- tried reverting to an earlier version of AMD-GPU Forge along with the Hip Sdk5.7, but pytorch started breaking ( maybe latest amd drivers need to be also reverted? who knows )
- This is a fully-supported AMD GPU 7900xtx with 24 GB of VRAM; these issues should not be happening.
- Basically, trying to regress just opens up a hell hole of broken dependencies, reverting is not really an option anymore so the only way is forward.
Can you please fix these out of memory issues and get back to the original performance when you supported Hip 5.7? Otherwise what's the point of all these updates?
Please let me know if you have other work arounds, thanks again!
[Output example for single generation:]
( as you can see, txt2img took 16 seconds, adetailer took 6 seconds, tiled upscale took 21 seconds, but the next step after that took 5 minutes )
28/28 [00:16<00:00, 1.75it/s]
[Unload] Trying to free 4202.86 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15746.38 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 15746.62 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████| 35/35 [00:21<00:00, 1.62it/s]
[Unload] Trying to free 7671.96 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15728.52 MB ... Done.
[Unload] Trying to free 2845.44 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15719.75 MB ... Done.
100%|██████████████████████████████████████████████████████████████████████████████| 15/15 [04:31<00:00, 18.07s/it]
[Unload] Trying to free 9456.43 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15718.39 MB ... Done.
0: 640x448 1 face, 8.9ms
Speed: 9.2ms preprocess, 8.9ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 448)
[Unload] Trying to free 3409.76 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15723.54 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15746.76 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15758.14 MB ... Done.
[Unload] Trying to free 1264.64 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15758.02 MB ... Done.
100%|██████████████████████████████████████████████████████████████████████████████| 12/12 [00:06<00:00, 1.72it/s]
[Output example for batch of 4:]
Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15747.60 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15756.99 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15757.10 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15756.49 MB ... Done.
[Unload] Trying to free 5058.56 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15753.64 MB ... Done.
100%|██████████████████████████████████████████████████████████████████████████████| 28/28 [01:03<00:00, 2.28s/it]
[Unload] Trying to free 4202.86 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15743.71 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 15744.67 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████| 35/35 [00:21<00:00, 1.63it/s]
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 15778.67 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████| 35/35 [00:21<00:00, 1.64it/s]
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 15778.67 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████| 35/35 [00:21<00:00, 1.63it/s]
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 15778.67 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████| 35/35 [00:21<00:00, 1.64it/s]
[Unload] Trying to free 7671.96 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15609.34 MB ... Done.
[Unload] Trying to free 7671.96 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15609.07 MB ... Done.
Memory cleanup has taken 0.15 seconds
[Unload] Trying to free 7671.96 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15572.80 MB ... Done.
Memory cleanup has taken 0.12 seconds
[Unload] Trying to free 7671.96 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15572.53 MB ... Done.
Memory cleanup has taken 0.13 seconds
[Unload] Trying to free 11381.76 MB for cuda:0 with 1 models keep loaded ... Current free memory is 15634.72 MB ... Done.
0%| | 0/15 [00:02<?, ?it/s]
Traceback (most recent call last):
File "F:\stable-diffusion-webui-amdgpu-forge\modules_forge\main_thread.py", line 30, in work
self.result = self.func(*self.args, **self.kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\txt2img.py", line 131, in txt2img_function
processed = processing.process_images(p)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\processing.py", line 843, in process_images
res = process_images_inner(p)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\processing.py", line 1083, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\processing.py", line 1521, in sample
return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\processing.py", line 1625, in sample_hr_pass
samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\sd_samplers_kdiffusion.py", line 190, in sample_img2img
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "F:\stable-diffusion-webui-amdgpu-forge\modules\sd_samplers_common.py", line 281, in launch_sampling
return func()
File "F:\stable-diffusion-webui-amdgpu-forge\modules\sd_samplers_kdiffusion.py", line 190, in <lambda>
samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\k_diffusion\sampling.py", line 149, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\modules\sd_samplers_cfg_denoiser.py", line 199, in forward
denoised, cond_pred, uncond_pred = sampling_function(self, denoiser_params=denoiser_params, cond_scale=cond_scale, cond_composition=cond_composition)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\sampling\sampling_function.py", line 362, in sampling_function
denoised, cond_pred, uncond_pred = sampling_function_inner(model, x, timestep, uncond, cond, cond_scale, model_options, seed, return_full=True)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\sampling\sampling_function.py", line 303, in sampling_function_inner
cond_pred, uncond_pred = calc_cond_uncond_batch(model, cond, uncond_, x, timestep, model_options)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\sampling\sampling_function.py", line 273, in calc_cond_uncond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\modules\k_model.py", line 45, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 713, in forward
h = module(h, emb, context, transformer_options)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 83, in forward
x = layer(x, context, transformer_options)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 321, in forward
x = block(x, context=context[i], transformer_options=transformer_options)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 181, in forward
return checkpoint(self._forward, (x, context, transformer_options), None, self.checkpoint)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 12, in checkpoint
return f(*args)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 235, in _forward
n = self.attn1(n, context=context_attn1, value=value_attn1, transformer_options=extra_options)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\nn\unet.py", line 154, in forward
out = attention_function(q, k, v, self.heads, mask)
File "F:\stable-diffusion-webui-amdgpu-forge\backend\attention.py", line 335, in attention_pytorch
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.56 GiB. GPU 0 has a total capacity of 23.98 GiB of which 0 bytes is free. Of the allocated memory 32.27 GiB is allocated by PyTorch, and 463.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
CUDA out of memory. Tried to allocate 23.56 GiB. GPU 0 has a total capacity of 23.98 GiB of which 0 bytes is free. Of the allocated memory 32.27 GiB is allocated by PyTorch, and 463.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Hey, you didnt showed which settings you used.
(Resolution, Steps, Hires Fix settings, Checkpoint, Sampler etc.)
Also did you used Batch Size or Batch Count? There is a massive difference inbetween them and their vram usage.
Also you have to use these launch args: --use-zluda --cuda-stream --attention-quad --skip-ort and not --zluda
After changing that delete the venv folder and relaunch the webui-user.bat
If you get a numpy lib import error while its installing, just relaunch again, then it will proceed and open up.
Also make sure Wallpaper Engine isn't running as it causes massive Vram issues when it runs together with the Webui even if Wallpaper Engine is paused in the background. It also leads to stutters and PC freezes. As I have the same GPU I would recommend to downgrade to the Adrealin 25.3.1 Driver. Its currently the most stable Version for 7000series cards.
Oh interesting had no idea I need those settings, is that a recent change? Bc in the old version that worked fine, I used to use jus --zluda and --theme dark, that's all.
Yes, it is batch Size not count, aka four images in parallel, and yes, I would do like 32 consecutive batches of 4 images each w hires fix and adetailer and it would churn thru em no problem.
Now, as I said, it cannot get through even a single batch of 4 without oom, and horribly slow even with a batch Size of 1. Will let u know when I try those new settings thank you kind sir!
Also I do not own wallpaper engine, that's the app with the animated desktop background you get on steam, right? Never installed it or owned it.
Report back soon, thx so much!
Hey, you didnt showed which settings you used. (Resolution, Steps, Hires Fix settings, Checkpoint, Sampler etc.) Also did you used Batch Size or Batch Count? There is a massive difference inbetween them and their vram usage. Also you have to use these launch args:
--use-zluda --cuda-stream --attention-quad --skip-ortand not --zluda After changing that delete the venv folder and relaunch the webui-user.bat If you get a numpy lib import error while its installing, just relaunch again, then it will proceed and open up.Also make sure Wallpaper Engine isn't running as it causes massive Vram issues when it runs together with the Webui even if Wallpaper Engine is paused in the background. It also leads to stutters and PC freezes. As I have the same GPU I would recommend to downgrade to the Adrealin 25.3.1 Driver. Its currently the most stable Version for 7000series cards.
Hey, your recommended args WORKED, and I can finally run a batch of 4 w/ hires fix again, thank you SO much!
Mind if I ask, how on earth did you knwo that those args would fix it? Just trial and error?
fyi here's a list of all the launch args taht did NOT work, if you're curious
@REM set COMMANDLINE_ARGS=--zluda --theme dark
Nice, glad it worked. I knew these args will work as they are the recommended ones of my Setup Guide for Forge with Zluda. Testing and sharing Information between other AMD and Webui Users since 2023 got me a good knowledge on this stuff. Don't forget to close this issue and the other one https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu-forge/issues/106