stable-diffusion-webui-amdgpu [Bug]: Total Progress is much faster than usuall. Shows 10x speed but its not AMD ZLUDA

Checklist

[x] The issue exists after disabling all extensions
[x] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[x] The issue exists in the current version of the webui
[x] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

Works very fast. But its really not. It does usual speed. Like txt2Image does 7 seconds old commits. Now it shows 23 it/s instead of 2-3its/s but VAE decode takes same time as if didn't. i've installed new fresh to check and it works 10x faster to make it, but VAE works same for some reason. 1576x1576 shows done in 20 sec. but stops at last step and VAE is working to decode. Also Live Preview can't show anything due to "speed"

Steps to reproduce the problem

Any prompt and look Console.

What should have happened?

Show actual speed.

What browsers do you use to access the UI ?

Firefox.

Sysinfo

sysinfo-2024-07-27-00-00.json

Console logs

No errors

Additional information

No response

Jul 27 '24 19:07 VeteranXT

I recently make it worked with zluda and got the same problem

Jul 30 '24 08:07 nicodem09

Me too. The progress bar rushes to 100% in a couple of seconds. But after that it takes its normal time for decoding.

Jul 30 '24 10:07 Morpheus-79

I temporarily fixed it by enabling refiner or adetailer

Jul 30 '24 10:07 nicodem09

I already had adetailer enabled by default when the problem occured.

Jul 30 '24 12:07 Morpheus-79

the problem persist in my case

Jul 31 '24 23:07 nicodem09

i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.

Aug 01 '24 02:08 DJHanceNL

i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.

VAE is fine, its bug that shows unreal speed of progress.

Aug 01 '24 04:08 VeteranXT

Confirming the problem. Conducted a series of experiments.

I had a commit “371f53e...0bde866”. The preview window works without problems. With speed and display in the console everything is fine.
I tried installing commit “61aa844...67fdead”. It comes before the upgrade to 1.10. It has an ONNX error, but it is solved by applying “--skip-ort”. The preview window works without problems. With speed and display in the console everything is fine.
If I update to commit “67fdead...235a1ff” (version 1.10) or higher, the preview window breaks immediately + problems with speed and display in the console =( Radeon RX 5500 XT 8Gb, Windows 10, python 3.10.11, HIP SDK 5.7.1 + ROCmLibs for old cards, Zluda

Aug 23 '24 13:08 Kargim

Since only some users are affected: it seems to be related to ZLUDA. I'm using a Ryzen 9 6900HX Rembrandt APU, Windows 11, Python 3.10.11, HIP SDK 6.1.2 + ROCmLibs for gfx1035 with ZLUDA.

Aug 26 '24 18:08 Morpheus-79

I'm using older Version of HIP SDK. Never updated HIP/Roocmlibs.

Sep 13 '24 00:09 VeteranXT

Enabling Control net, then preview is okay.

Sep 14 '24 00:09 VeteranXT

interestingly i noticed the UNIPC scheduler does not show the issue. certain extensions also trigger the clock timing to become realistic.

Sep 23 '24 20:09 ride5k

In CPU's view, GPU is an I/O device. Although CPU requests GPU to execute something, when the requested task is done, actually, CPU does not know whether it is or not yet. Therefore, the CPU should synchronize the state of GPU. However, it does not synchronize every call due to performance. That means the programmer should synchronize the state in order to get proper results from GPU. Fortunately, torch does this synchronization work instead of us. It synchronizes, for example, when we print tensor, detach tensor from GPU, etc. In your case, for some reason, the synchronization wasn't done successfully during generation (in each sampler step). However, to convert the final latent as an image, the synchronization should occur at least at the last tensor detachment. Therefore, when it synchronizes, GPU has lots of tasks to run, but very few tasks are done. It leads "the last" synchronization to take a really long time. For now, the reason why synchronization fails is unknown. I haven't tried to find out the reason yet. Maybe it is a bug of AMD Comgr or ZLUDA itself. It seems to be able to appear suddenly and disappear whenever. So, I can't tell you a reason or a clear solution at this moment.

Oct 25 '24 04:10 lshqqytiger

Thanks for explanation.

Oct 25 '24 08:10 VeteranXT

I think it is VAE problem with ZLUDA or AMD rocm. I use comfyui, after I change directml to ZLUDA, sampler works 2-3x faster, but VAE is very slow. The comfyui can show each node time used. Also i add some debug info in sd.py vae.decode function, found that is ave stage, not sampler.

RT: VAE decode memory_used=3853910016 free_memory=4493349376 batch_number=1 vae_dtype=torch.float16

For 768x1152 vae decode, directml take 0.5-1s, but ZLUDA take 6-8s.

Then i add more test, found that ZLUDA vae speed is strongly related image resolution. Here are some ZLUDA ave decode results: 512x512: 0.5-1s 768x768: 3-4s 960x960: 6-8s 1024x1024: 13-15s

Directml use about 1s, no matter how resolution changed.

Oct 31 '24 05:10 roytan883