ComfyUI
ComfyUI copied to clipboard
Remove AMD workarounds as they do more harm than good on recent ROCm releases.
Tested on gfx1201, ROCm 6.4.1. It fixes VAE Decode issues and generally performance is better with pytorch flash attention. bfloats16 are working fine.
pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3
Ok, fair, it can wait. I was using pytorch wheels from AMD https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ to validate those changes.
pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3
I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those is_amd checks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.
pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3
I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those
is_amdchecks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.
IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.
pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3
I've rebased. I think we should merge this. We could add some version checks, but it's not clear to me on what versions it is broken. (also upstream ROCm 6.4 will probably be on next pytorch release, which can take months) The main issue currently is that those
is_amdchecks are unconditional and cannot be controlled by user. While after those changes, users still can force fp32 VAE if they want.IMO, it is much safer if we have flag to enable the new is_amd checking logic until we actually benchmark and see that there are no performance regression on older amd cards.
It's already possible to force VAE type. There is no need to have hardcoded is_amd cases. Also it's was reported to cause OOMs on both new and old gpus.
@kasper93 sorry to bother you. I tried your fork with this commit and the massive memory usage during VAE Decode persists. Also, adding the following flag: --use-pytorch-cross-attention increase the memory usage during the ksample a lot. I use pytorch nightly, ROCm 6.4.1, Rx 9070 xt in the official almalinux container.
Well, yes. It still uses lots of memory, but it's less than before and previously fallback to tailed vae would take ages because of conversions, this seems to work more snappy too.
--use-pytorch-cross-attentionincrease the memory usage during the ksample a lot.
Maybe, but it's a lot faster.
The main goal of this PR is to remove hardcoded conditions and allow to evaluate other components in the pipeline.
Thanks for the fast reply and the explanations. I forgot to mention that I haven't noticed any problems with using BF16 so far, if it can helps.
Rebased. Pytorch 2.8.0 with support for ROCm 6.4 has been released too.
pytorch rocm 6.4 is still nightly so I can't merge some of these changes as is or it's going to break people using the stable pytorch which is on 6.3
@comfyanonymous PyTorch ROCm 6.4 has released a stable version, Can we merge now? The VAE crash is too uncomfortable.
I have experienced the VAE issue on gfx1151 on every ROCm version so far, up to and including ROCm 7.0.
The best workaround that has worked reliably for me has been described here: https://github.com/ROCm/ROCm/issues/4729#issuecomment-2888209739
In short:
export MIOPEN_FIND_MODE=2
What this option does is documented here: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html#find-modes
Basically, MIOpen by default trials multiple kernels for best performance, which involves compiling and benchmarking, trying to find the kernel best suited for the pending workload. In fast-mode it skips that (very expensive) heuristic and falls back to the naive, non-optimized implementation, aka: just do it. Which is actually totally fine for VAE. Imagine doing many minutes of work to know how to approach a fraction-of-a-second workload best.
VAE Decode has been about instant for SDXL with no noticeable delay so far, and with no blackouts or OOMs.
Rebased after recent changes, some changes were already upstreamed.