HIP [Issue]: cooperativeLaunch unavailable on RX 9070 XT with ROCm 6.4.1

Problem Description

I have a program that is relying on cooperative groups, ie. it is using hipLaunchCooperativeKernel(). The program worked fine with my RX 6800. However, since I upgraded to RX 9070 XT I'm just seing the following error:

HIP error: unspecified launch failure at hip.cpp:356

Upon investigation I discovered that hipDeviceProp_t.cooperativeLaunch is now false. This is also confirmed by hipInfo:

$ ./hipInfo | grep cooperativeLaunch
cooperativeLaunch:                0

Did RDNA4 drop support for cooperative groups or is this a ROCm 6.4.1 issue? (I was forced to upgrade because the ROCm version included in Ubuntu does not support RDNA4).

Operating System

Ubuntu 25.04

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Radeon RX 9070 XT

ROCm Version

ROCm 6.4.1

ROCm Component

HIP

Steps to Reproduce

$ ./hipInfo | grep cooperativeLaunch
cooperativeLaunch:                0

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.15
Runtime Ext Version:     1.7
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 5800X 8-Core Processor 
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 5800X 8-Core Processor 
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4853                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    31755704(0x1e48db8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    31755704(0x1e48db8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    31755704(0x1e48db8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    31755704(0x1e48db8) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1201                            
  Uuid:                    GPU-d77d3229494d292f               
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      8192(0x2000) KB                    
    L3:                      65536(0x10000) KB                  
  Chip ID:                 30032(0x7550)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          256(0x100)                         
  Max Clock Freq. (MHz):   2460                               
  BDFID:                   2560                               
  Internal Node ID:        1                                  
  Compute Unit:            64                                 
  SIMDs per CU:            2                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 962                                
  SDMA engine uCode::      838                                
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16695296(0xfec000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1201         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx12-generic   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

No response

Jun 08 '25 10:06 mwuertinger

Hi @mwuertinger. Internal ticket has been created to investigate this issue. Thanks!

Jun 09 '25 14:06 ppanchad-amd

Can you share the ROCm version you were using on 6800. Also a sample code of how you are using it.

Jun 09 '25 14:06 cjatin

@cjatin I've been using the ROCm that comes with Ubuntu 25.04. That seems to be 5.7.1: https://launchpad.net/ubuntu/+source/rocminfo

Now I installed ROCm from https://repo.radeon.com/rocm/apt/6.4.1 as explained here: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.0/how-to/native-install/ubuntu.html

Unfortunately the latest supported Ubuntu version seems to be 22.04. I'm using Ubuntu 25.04 and compiling and running any program, including the example below fails with:

free(): invalid pointer
Aborted

So what I did instead is use an ubuntu:22.04 Docker container, install ROCm as explained above in that container and do the compilation there. At least the programs are no longer crashing right away that way.

Here's a small example program demonstrating that cooperative groups are no longer working:

#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <hip/hip_cooperative_groups.h>
#include <iostream>
#include <vector>

#define HIP_CHECK(expression)                                                                                                              \
  {                                                                                                                                        \
    hipError_t status = expression;                                                                                                        \
    if (status != hipSuccess) {                                                                                                            \
      std::cerr << "HIP error: " << hipGetErrorString(status) << " at " << __FILE__ << ":" << __LINE__ << std::endl;                       \
      exit(1);                                                                                                                             \
    }                                                                                                                                      \
  }

__global__ void simulate() {
}

int main(const int argc, const char **argv) {
  hipStream_t stream;
  HIP_CHECK(hipStreamCreate(&stream));

  void *args[] = {};
  HIP_CHECK(hipLaunchCooperativeKernel((void *)simulate, dim3(64), dim3(1024), args, 0, stream));
  HIP_CHECK(hipStreamSynchronize(stream));
}

Output:

HIP error: unspecified launch failure at hip-issue-min.cpp:26

Jun 09 '25 20:06 mwuertinger

Another way to reproduce is https://github.com/ROCm/rocm-examples/blob/amd-staging/HIP-Basic/cooperative_groups/main.hip which just prints the following on my system:

$ ./hip_cooperative_groups 
Skipping, device 0 does not support cooperative groups

Jun 09 '25 20:06 mwuertinger

I have modified the sample https://gist.github.com/cjatin/6420cd9cda4437ea3957fba792cc3c3e can you try running this.

Regarding what's happening, hipLaunchCooperativeKernel allocates a cooperative queue which relies on Global Wave Sync(GWS). Navi4 does not have GWS. What I would recommend is use hipLaunchKernel or <<< >>> launch instead of hipLaunchCooperativeKernel. Meanwhile I will dig a bit deeper into GWS+RDNA4 and how it does sync.

Jun 10 '25 08:06 cjatin

If I just replace hipLaunchCooperativeKernel with hipLaunchKernel I get the following output:

Memory access fault by GPU node-1 (Agent handle: 0x34fd2350) on address (nil). Reason: Page not present or supervisor privilege.
Aborted

My kernel is using cooperative_groups::this_grid() to sync all the threads across groups. If I interpret this tutorial correctly it is required to use hipLaunchCooperativeKernel to start such a kernel: https://rocm.docs.amd.com/projects/HIP/en/docs-develop/tutorial/cooperative_groups_tutorial.html

Running https://gist.github.com/cjatin/6420cd9cda4437ea3957fba792cc3c3e results in Validation passed.. It seems that some but not all cooperative group features are working.

Do you know why Navi 4 does not have Global Wave Sync (GWS)? Is that something that is just missing in ROCm / in the driver or is that a hardware problem? It seems like a big step backwards to drop support for this.

Jun 10 '25 20:06 mwuertinger

This issue has been migrated to: https://github.com/ROCm/rocm-systems/issues/401

Aug 18 '25 18:08 systems-assistant[bot]

I think this should be sorted in rocm 7.0 release, driver was reporting coop queues as 0.

Aug 19 '25 11:08 cjatin