Support composable kernel on RDNA3 (7900 xtx)
Dear AMD developers! On behalf of thousands of ROCM users, I ask you to add support for RDNA3 and in particular 7900 xtx. We have been waiting for xformers support, flash attention 2 and other memory management optimization technologies for several years. Many tickets are closed without being completed. And each time they wrote "open a new ticket". When will you hear from us? After all, 7900 xtx is a very good card with a large amount of memory today.
Here are some related tickets: https://github.com/ROCm/composable_kernel/issues/1434 https://github.com/ROCm/composable_kernel/issues/1171 https://github.com/ROCm/composable_kernel/issues/634 https://github.com/ROCm/xformers/issues/17 https://github.com/ROCm/xformers/issues/16
I installed ComfyUI on my friends PC (with a 2080) and ran some wan2.1 benchmarks against my own 7800xt on linux. The 2080 kicks the pants off my 7800xt and its very embarrassing. I don't think I can optimize it anymore on my own.
navi3 is not the top-1 priority of CK now. cc @illsilin for further comment
navi3 is not the top-1 priority of CK now. cc @illsilin for further comment
I think its very clear from this post, and the general opinion on the internet, that we know that. Not prioritizing these cards for years is what AMD's customers take issue with.
With the introduction of AITER and other tools that rely on CK, NAVI 3 systems become almost unusable. While you can use almost every low-end hardware from NVIDIA, AMD hardware is really no option at all. Very sad.
With the introduction of AITER and other tools that rely on CK, NAVI 3 systems become almost unusable. While you can use almost every low-end hardware from NVIDIA, AMD hardware is really no option at all. Very sad.
Yup, sold my 7900 and got a 5090. Everything is easy to install no weird workarounds to get anything working.
@illsilin Is there any roadmap for navi support?
I think CK is already supported for RDNA3(and also rdna4)? https://github.com/RenderKit/oidn/pull/251
С введением AITER и других инструментов, которые полагаются на CK, системы NAVI 3 становятся практически непригодными для использования. В то время как вы можете использовать практически любое низкоуровневое оборудование от NVIDIA, оборудование AMD на самом деле вообще не вариант. Очень грустно.
Да, продал свой 7900 и купил 5090. Все легко устанавливается, никаких странных обходных путей, чтобы что-то работало.
I think this is the best solution. Although it is very sad. I still believe in AMD engineers (but much less).
navi3 is not the top-1 priority of CK now. cc @illsilin for further comment
@carlushuang @illsilin Doesn't navi3/4 support already exist in CK via. wmma though?
Dear AMD developers! On behalf of thousands of ROCM users, I ask you to add support for RDNA3 and in particular 7900 xtx. We have been waiting for xformers support, flash attention 2 and other memory management optimization technologies for several years. Many tickets are closed without being completed. And each time they wrote "open a new ticket". When will you hear from us? After all, 7900 xtx is a very good card with a large amount of memory today.
Here are some related tickets: #1434 #1171 #634 ROCm/xformers#17 ROCm/xformers#16
@hackey and others reading this - if you're on Windows, can you try this pytorch wheel on your 7900xtx and let me know if you face any issues? https://github.com/scottt/rocm-TheRock/releases/tag/v6.5.0rc-pytorch-gfx110x. It was built using TheRock . The wheel comes bundled with all required ROCm libraries, so you don't have to install ROCm yourself. Just pip install it and you should be good to go.
This uses aotriton which provides flash attention kernels (both fwd and bwd). In my experience, aotriton SDPA has performed just as fast as xformers on MI300X, with negligible VRAM usage difference vs. NVIDIA implementations. It'd be good to know if it doesn't however. I'd be curious to know the cases where xformers performs faster than SDPA.
Any code you use that requires xformers should be easily replacable with pytorch's scaled_dot_product_attention , or the code should already have an option to use this SDPA. E.g., comfyui provides the --use-pytorch-cross-attention for this.
This uses aotriton which provides flash attention kernels (both fwd and bwd). In my experience, aotriton SDPA has performed just as fast as xformers on MI300X, with negligible VRAM usage difference vs. NVIDIA implementations. It'd be good to know if it doesn't however. I'd be curious to know the cases where xformers performs faster than SDPA.
afaik official Pytorch already has support for aotriton, you don't need a third party fork for this
Navi3 support does not exist in CK as of writing. This is easy to see by trying to build FlashAttention with CK backend:
/home/feep/flash-attention-upstream/csrc/composable_kernel/include/ck_tile/core/numeric/bfloat16.hpp:178:21: error: invalid operand for instruction
178 | asm volatile("\n \
| ^
<inline asm>:2:26: note: instantiated into assembly here
2 | v_cmp_u_f32 s[16:17], v9, v9
| ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
This is even with newest develop branch of CK.