[Issue]: cooperativeLaunch unavailable on RX 9070 XT with ROCm 6.4.1
Problem Description
I have a program that is relying on cooperative groups, ie. it is using hipLaunchCooperativeKernel(). The program worked fine with my RX 6800. However, since I upgraded to RX 9070 XT I'm just seing the following error:
HIP error: unspecified launch failure at hip.cpp:356
Upon investigation I discovered that hipDeviceProp_t.cooperativeLaunch is now false. This is also confirmed by hipInfo:
$ ./hipInfo | grep cooperativeLaunch
cooperativeLaunch: 0
Did RDNA4 drop support for cooperative groups or is this a ROCm 6.4.1 issue? (I was forced to upgrade because the ROCm version included in Ubuntu does not support RDNA4).
Operating System
Ubuntu 25.04
CPU
AMD Ryzen 7 5800X 8-Core Processor
GPU
AMD Radeon RX 9070 XT
ROCm Version
ROCm 6.4.1
ROCm Component
HIP
Steps to Reproduce
$ ./hipInfo | grep cooperativeLaunch
cooperativeLaunch: 0
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.15
Runtime Ext Version: 1.7
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 5800X 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 5800X 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4853
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 31755704(0x1e48db8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 31755704(0x1e48db8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 31755704(0x1e48db8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 31755704(0x1e48db8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1201
Uuid: GPU-d77d3229494d292f
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 8192(0x2000) KB
L3: 65536(0x10000) KB
Chip ID: 30032(0x7550)
ASIC Revision: 1(0x1)
Cacheline Size: 256(0x100)
Max Clock Freq. (MHz): 2460
BDFID: 2560
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 2
Shader Engines: 4
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 962
SDMA engine uCode:: 838
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16695296(0xfec000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1201
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx12-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
No response
Hi @mwuertinger. Internal ticket has been created to investigate this issue. Thanks!
Can you share the ROCm version you were using on 6800. Also a sample code of how you are using it.
@cjatin I've been using the ROCm that comes with Ubuntu 25.04. That seems to be 5.7.1: https://launchpad.net/ubuntu/+source/rocminfo
Now I installed ROCm from https://repo.radeon.com/rocm/apt/6.4.1 as explained here: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.0/how-to/native-install/ubuntu.html
Unfortunately the latest supported Ubuntu version seems to be 22.04. I'm using Ubuntu 25.04 and compiling and running any program, including the example below fails with:
free(): invalid pointer
Aborted
So what I did instead is use an ubuntu:22.04 Docker container, install ROCm as explained above in that container and do the compilation there. At least the programs are no longer crashing right away that way.
Here's a small example program demonstrating that cooperative groups are no longer working:
#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <hip/hip_cooperative_groups.h>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
hipError_t status = expression; \
if (status != hipSuccess) { \
std::cerr << "HIP error: " << hipGetErrorString(status) << " at " << __FILE__ << ":" << __LINE__ << std::endl; \
exit(1); \
} \
}
__global__ void simulate() {
}
int main(const int argc, const char **argv) {
hipStream_t stream;
HIP_CHECK(hipStreamCreate(&stream));
void *args[] = {};
HIP_CHECK(hipLaunchCooperativeKernel((void *)simulate, dim3(64), dim3(1024), args, 0, stream));
HIP_CHECK(hipStreamSynchronize(stream));
}
Output:
HIP error: unspecified launch failure at hip-issue-min.cpp:26
Another way to reproduce is https://github.com/ROCm/rocm-examples/blob/amd-staging/HIP-Basic/cooperative_groups/main.hip which just prints the following on my system:
$ ./hip_cooperative_groups
Skipping, device 0 does not support cooperative groups
I have modified the sample https://gist.github.com/cjatin/6420cd9cda4437ea3957fba792cc3c3e can you try running this.
Regarding what's happening, hipLaunchCooperativeKernel allocates a cooperative queue which relies on Global Wave Sync(GWS). Navi4 does not have GWS.
What I would recommend is use hipLaunchKernel or <<< >>> launch instead of hipLaunchCooperativeKernel.
Meanwhile I will dig a bit deeper into GWS+RDNA4 and how it does sync.
If I just replace hipLaunchCooperativeKernel with hipLaunchKernel I get the following output:
Memory access fault by GPU node-1 (Agent handle: 0x34fd2350) on address (nil). Reason: Page not present or supervisor privilege.
Aborted
My kernel is using cooperative_groups::this_grid() to sync all the threads across groups. If I interpret this tutorial correctly it is required to use hipLaunchCooperativeKernel to start such a kernel: https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/cooperative_groups_tutorial.html
Running https://gist.github.com/cjatin/6420cd9cda4437ea3957fba792cc3c3e results in Validation passed.. It seems that some but not all cooperative group features are working.
Do you know why Navi 4 does not have Global Wave Sync (GWS)? Is that something that is just missing in ROCm / in the driver or is that a hardware problem? It seems like a big step backwards to drop support for this.
This issue has been migrated to: https://github.com/ROCm/rocm-systems/issues/401
I think this should be sorted in rocm 7.0 release, driver was reporting coop queues as 0.