stable-diffusion-webui-amdgpu-forge RX 6600M (gfx1032) on ROCm 6.2 + ZLUDA: Hangs at "Compilation is in progress", computation

Hello, I'm having trouble getting image generation to work on my AMD Radeon RX 6600M (gfx1032) with the latest amdgpu-forge. I'm hoping someone can offer some insights or solutions.

My Environment:

PC Model: Minisforum HX99G
CPU: Ryzen 9 6900HX (8-core/16-thread)
RAM: 32GB DDR5-4800
GPU: AMD Radeon RX 6600M (gfx1032)
OS: Windows 11
ROCm/HIP SDK: 6.2 (with brknsoul/ROCmLibs for gfx1032 rocm6.2 and HIP-SDK-extension.zip applied)
ZLUDA: Latest nightly from lshqqytiger/ZLUDA for ROCm 6.x (placed in the Forge .zluda folder)
Python: 3.10.6
amdgpu-forge version: Commit hash e07be6a48fc0ae1840b78d5e55ee36ab78396b30
PyTorch versions tried: 2.3.0+cu118 (currently installed), also tried 2.6.0+cu118, 2.2.1+cu118, 2.1.2+cu118.
initialize.py modification: Added os.environ['CUDA_LAUNCH_BLOCKING'] = '1' and torch.backends.cudnn.enabled = False after import torch.

*Problem Description: The WebUI launches successfully, the GPU is recognized (e.g., Device: cuda:0 AMD Radeon RX 6600M [ZLUDA]), and models (like v1-5-pruned-emaonly.safetensors and others) load without error. However, when I try to generate an image (even with simple prompts and default settings):

The progress bar appears at 0%.
The console shows "Compilation is in progress. Please wait..." (this message sometimes appears after the 0% progress bar).
The process then hangs indefinitely. No image is generated.
Task Manager shows some CPU usage (~10%), minimal GPU compute usage, but GPU memory is significantly utilized (as expected from model load).
The "Interrupt" button in the UI becomes unresponsive, requiring a manual shutdown of webui-user.bat.
No specific CUDA error messages appear in the console, even with CUDA_LAUNCH_BLOCKING=1. Key Observation - computation_dtype: A consistent observation in the console log when loading a model is: K-Model Created: {'storage_dtype': torch.float16, 'computation_dtype': torch.float16} This happens even when I use the COMMANDLINE_ARGS=--use-zluda --no-half --precision full. It seems the --no-half --precision full flags are not forcing the K-Sampler's computation_dtype to torch.float32. I suspect this might be related to the hang during compilation/execution on my AMD GPU. COMMANDLINE_ARGS tried: I've tried various combinations, including:
--use-zluda --no-half --precision full --skip-version-check
--use-zluda --no-half --precision full --opt-sub-quad-attention --disable-nan-check --upcast-sampling --skip-version-check
And others, like with --lowvram (though Forge says memory management is automatic now). Questions:
Is there a known issue with computation_dtype remaining float16 for K-diffusion samplers on AMD/ZLUDA in Forge, despite using --no-half --precision full?
Is there any other way (e.g., different command-line argument, config file setting, or code modification) to reliably force computation_dtype to torch.float32 for the K-Model in this setup?
Are there any other known workarounds or troubleshooting steps for this "Compilation is in progress..." hang on RX 6600M (gfx1032) with ROCm 6.2 and ZLUDA? Any help or suggestions would be greatly appreciated. I'd really like to get Forge working on my setup.

Thank you!

May 26 '25 04:05 pontalos

If you are using process lasso or some similar programs then its possible that compilation process is working on 1 core of your processor only. Then it will work forever. Check in task manager if all cores are utilized and if there is activity on your processor.

May 28 '25 15:05 TheFerumn

I'm sorry for my late reply, and thank you for your response.
When I left it alone for a while, the image was generated, and after that, everything started working properly.
It seems that when using ZLUDA, there is a loading time of several tens of minutes during the initial startup and the first image generation.
I will close this thread now.

Jun 04 '25 01:06 pontalos

RX 6600M (gfx1032) on ROCm 6.2 + ZLUDA: Hangs at "Compilation is in progress", computation_dtype stuck at float16