ComfyUI Remove AMD workarounds as they do more harm than good on recent ROCm releases.

Tested on gfx1201, ROCm 6.4.1. It fixes VAE Decode issues and generally performance is better with pytorch flash attention. bfloats16 are working fine.

May 26 '25 12:05 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

May 26 '25 18:05 comfyanonymous

Ok, fair, it can wait. I was using pytorch wheels from AMD https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ to validate those changes.

May 26 '25 18:05 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

Jun 11 '25 14:06 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.

Jun 11 '25 15:06 chutchatut

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.

It's already possible to force VAE type. There is no need to have hardcoded is_amd cases. Also it's was reported to cause OOMs on both new and old gpus.

Jun 11 '25 19:06 kasper93

@kasper93 sorry to bother you. I tried your fork with this commit and the massive memory usage during VAE Decode persists. Also, adding the following flag: --use-pytorch-cross-attention increase the memory usage during the ksample a lot. I use pytorch nightly, ROCm 6.4.1, Rx 9070 xt in the official almalinux container.

Jun 20 '25 12:06 wasd-tech

Well, yes. It still uses lots of memory, but it's less than before and previously fallback to tailed vae would take ages because of conversions, this seems to work more snappy too.

--use-pytorch-cross-attention increase the memory usage during the ksample a lot.

Maybe, but it's a lot faster.

The main goal of this PR is to remove hardcoded conditions and allow to evaluate other components in the pipeline.

Jun 20 '25 12:06 kasper93

Thanks for the fast reply and the explanations. I forgot to mention that I haven't noticed any problems with using BF16 so far, if it can helps.

Jun 20 '25 12:06 wasd-tech

Rebased. Pytorch 2.8.0 with support for ROCm 6.4 has been released too.

Aug 07 '25 19:08 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

@comfyanonymous PyTorch ROCm 6.4 has released a stable version, Can we merge now? The VAE crash is too uncomfortable.

Aug 23 '25 09:08 githust66

I have experienced the VAE issue on gfx1151 on every ROCm version so far, up to and including ROCm 7.0.

The best workaround that has worked reliably for me has been described here: https://github.com/ROCm/ROCm/issues/4729#issuecomment-2888209739

In short:

export MIOPEN_FIND_MODE=2

What this option does is documented here: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html#find-modes

Basically, MIOpen by default trials multiple kernels for best performance, which involves compiling and benchmarking, trying to find the kernel best suited for the pending workload. In fast-mode it skips that (very expensive) heuristic and falls back to the naive, non-optimized implementation, aka: just do it. Which is actually totally fine for VAE. Imagine doing many minutes of work to know how to approach a fraction-of-a-second workload best.

VAE Decode has been about instant for SDXL with no noticeable delay so far, and with no blackouts or OOMs.

Sep 29 '25 00:09 AnhNhan

Rebased after recent changes, some changes were already upstreamed.

Oct 13 '25 03:10 kasper93

ComfyUI ComfyUI copied to clipboard

Remove AMD workarounds as they do more harm than good on recent ROCm releases.

ComfyUI
ComfyUI copied to clipboard