ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

Remove AMD workarounds as they do more harm than good on recent ROCm releases.

Open kasper93 opened this issue 6 months ago • 8 comments

Tested on gfx1201, ROCm 6.4.1. It fixes VAE Decode issues and generally performance is better with pytorch flash attention. bfloats16 are working fine.

kasper93 avatar May 26 '25 12:05 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

comfyanonymous avatar May 26 '25 18:05 comfyanonymous

Ok, fair, it can wait. I was using pytorch wheels from AMD https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ to validate those changes.

kasper93 avatar May 26 '25 18:05 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

kasper93 avatar Jun 11 '25 14:06 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.

chutchatut avatar Jun 11 '25 15:06 chutchatut

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.

IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.

It's already possible to force VAE type. There is no need to have hardcoded is_amd cases. Also it's was reported to cause OOMs on both new and old gpus.

kasper93 avatar Jun 11 '25 19:06 kasper93

@kasper93 sorry to bother you. I tried your fork with this commit and the massive memory usage during VAE Decode persists. Also, adding the following flag: --use-pytorch-cross-attention increase the memory usage during the ksample a lot. I use pytorch nightly, ROCm 6.4.1, Rx 9070 xt in the official almalinux container.

wasd-tech avatar Jun 20 '25 12:06 wasd-tech

Well, yes. It still uses lots of memory, but it's less than before and previously fallback to tailed vae would take ages because of conversions, this seems to work more snappy too.

--use-pytorch-cross-attention increase the memory usage during the ksample a lot.

Maybe, but it's a lot faster.

The main goal of this PR is to remove hardcoded conditions and allow to evaluate other components in the pipeline.

kasper93 avatar Jun 20 '25 12:06 kasper93

Thanks for the fast reply and the explanations. I forgot to mention that I haven't noticed any problems with using BF16 so far, if it can helps.

wasd-tech avatar Jun 20 '25 12:06 wasd-tech

Rebased. Pytorch 2.8.0 with support for ROCm 6.4 has been released too.

kasper93 avatar Aug 07 '25 19:08 kasper93

pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3

@comfyanonymous PyTorch ROCm 6.4 has released a stable version, Can we merge now? The VAE crash is too uncomfortable.

githust66 avatar Aug 23 '25 09:08 githust66

I have experienced the VAE issue on gfx1151 on every ROCm version so far, up to and including ROCm 7.0.

The best workaround that has worked reliably for me has been described here: https://github.com/ROCm/ROCm/issues/4729#issuecomment-2888209739

In short:

export MIOPEN_FIND_MODE=2

What this option does is documented here: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html#find-modes

Basically, MIOpen by default trials multiple kernels for best performance, which involves compiling and benchmarking, trying to find the kernel best suited for the pending workload. In fast-mode it skips that (very expensive) heuristic and falls back to the naive, non-optimized implementation, aka: just do it. Which is actually totally fine for VAE. Imagine doing many minutes of work to know how to approach a fraction-of-a-second workload best.

VAE Decode has been about instant for SDXL with no noticeable delay so far, and with no blackouts or OOMs.

AnhNhan avatar Sep 29 '25 00:09 AnhNhan

Rebased after recent changes, some changes were already upstreamed.

kasper93 avatar Oct 13 '25 03:10 kasper93