ComfyUI Possible way to mitigate hard crashes on AMD gpus when using ROCm backend

Possible way to mitigate hard crashes on AMD gpus when using ROCm backend

Open Th3Rom3 opened this issue 1 year ago • 6 comments

Hi everyone, I stumbled across a possible way of mitigating random hard crashes of my RX 6800 AMD gpu when using more complex workflows in ComfyUI.

Hardware:

Ryzen 7 5800X3D
32GB DDR4 3200 MT/s RAM
RX 6800 (non-XT) 16GB VRAM (Navi21, gfx1030)

Software

Nobara 39 @ Kernel 6.6.9
ROCm 5.6
Python 3.11.5 (separate venv managed via conda)
pytorch 2.1.2+rocm5.6
ComfyUI Revision: 1869 [66831eb6]

ComfyUI used to run fine with SD 1.5 and SDXL models, even larger image creation dimensions and bigger batch sizes were no problem if they fit into the 16GB VRAM.

During more complex workflows however I would experience random hard crashes of the gpu prompting a system reboot. Those crashes were happening especially in subsequent VAE encoding/decoding steps or additional processing nodes like face detailer or upscaling. I was seemingly able to reduce the occurence using --disable-smart-memory but ultimately it still kept randomly crashing every other prompt run (some ran fine, others crashed at random steps in the workflow). I was able to completely circumvent the crashes by splitting up long workflows into separate segments using node bypassing and running it step by step sequentially.

Then I recently came across an old reddit post where a RX6800XT user experienced similar things when running Automatic1111: https://www.reddit.com/r/StableDiffusion/comments/12faj1y/amd_gpu_forced_to_reboot_on_linuxauto1111/

This user attributed the issues to how the ROCm backend handles garbage collection in the pytorch module. I then adopted the same ROCm environment variable mentioned above: PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144

Since then I did not experience any hard crashes when rerunning the same workflows that used to randomly crash before. I have no deeper knowledge about the ROCm backend settings and have only adopted the values mentioned in the Reddit thread without own tinkering with my 16GB VRAM card. But it alleviated most if not all my hard crash issues running ComfyUI.

TL;DR Running ComfyUI using ROCm with additional garbage collection parameters via PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144 python main.py on my 16GB RX6800 fixed my hard crashes in most if not all situations.

The steps mentioned above might also help with other AMD cards but I would assume that the garbage collection parameters might have to be adjusted according to VRAM size.

If this fix ends up helping other AMD users it might be helpful to mention it as a possible troubleshooting step in the AMD section of the installation guide.

If anyone more knowledgable on the ROCm backend can provide additional insight into different settings it would be appreciated to make the fix more universal.

If there are any outstanding questions I will try to provide as much information as possible.

Jan 06 '24 11:01 Th3Rom3

ComfyUI ComfyUI copied to clipboard

Possible way to mitigate hard crashes on AMD gpus when using ROCm backend

ComfyUI
ComfyUI copied to clipboard