AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Regression with AMDGPU.jl v1.3.4 on JACC parallel_reduce CI

Open williamfgc opened this issue 5 months ago • 7 comments

Questionnaire

  1. Does ROCm works for you outside of Julia, e.g. C/C++/Python? Yes

  2. Post output of rocminfo.

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          NO

*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-7f99cc8d20f3c038               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   10496                              
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                      
  1. Post output of AMDGPU.versioninfo() if possible.
# paste the output of `AMDGPU.versioninfo()` here
julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬────────────────────────────────────
│ Available │ Name             │ Version   │ Path                              ⋯
├───────────┼──────────────────┼───────────┼────────────────────────────────────
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld         ⋯
│     +     │ Device Libraries │ -         │ /home/wfg/.julia-cousteau/artifac ⋯
│     +     │ HIP              │ 6.3.42134 │ /opt/rocm/lib/libamdhip64.so      ⋯
│     +     │ rocBLAS          │ 4.3.0     │ /opt/rocm/lib/librocblas.so       ⋯
│     +     │ rocSOLVER        │ 3.27.0    │ /opt/rocm/lib/librocsolver.so     ⋯
│     +     │ rocSPARSE        │ 3.3.0     │ /opt/rocm/lib/librocsparse.so     ⋯
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm/lib/librocrand.so       ⋯
│     +     │ rocFFT           │ 1.0.31    │ /opt/rocm/lib/librocfft.so        ⋯
│     +     │ MIOpen           │ 3.3.0     │ /opt/rocm/lib/libMIOpen.so        ⋯
└───────────┴──────────────────┴───────────┴────────────────────────────────────
                                                                1 column omitted

[ Info: AMDGPU devices
┌────┬────────────────────┬────────────────────────┬───────────┬────────────┬───
│ Id │               Name │               GCN arch │ Wavefront │     Memory │  ⋯
├────┼────────────────────┼────────────────────────┼───────────┼────────────┼───
│  1 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │        64 │ 31.984 GiB │  ⋯
│  2 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │        64 │ 31.984 GiB │  ⋯
└────┴────────────────────┴────────────────────────┴───────────┴────────────┴───
                                                                1 column omitted

Reproducing the bug

  1. Describe what's not working. Regression with AMDGPU v1.3.4 on JACC's parallel_reduce, works with AMDGPU v1.3.3 on MI100. CI logs have full information. Need to investigate further inside parallel_reduce, but this is only reproducible with the new version of AMDGPU.
  • Pass with AMDGPU.jl v1.3.3
  • Fail with AMDGPU.jl v1.3.4

errors are of the kind:

reduce: Test Failed at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/unittests.jl:166
  Expression: mxd == maximum(ah2)
   Evaluated: 3.3822674352068316 == 5.468821633677877

See JACC.jl issue

  1. Provide MWE to reproduce it (if possible). Please see above for JACC.jl CI.

williamfgc avatar Jul 01 '25 00:07 williamfgc

Please isolate this to a fully JACC free MWE. The changes in 1.3.4 are very limited https://github.com/JuliaGPU/AMDGPU.jl/releases/tag/v1.3.4 and the most likely to be relevant one is #783 which adds memory fence semantics to sync_workgroup something many users assumed it had, but it indeed lacked.

Provide MWE to reproduce it (if possible). Please see above for JACC.jl CI.

That is the opposite of a MWE.

vchuravy avatar Jul 01 '25 06:07 vchuravy

@vchuravy thanks, will do and hope to contribute to test coverage.

williamfgc avatar Jul 01 '25 10:07 williamfgc

What's the status of this given we are now at AMDGPU 2?

luraess avatar Sep 05 '25 12:09 luraess

Finally getting around to this. And yes, still seeing a problem with v2.1.0. In the following code, I have a simplified version of part of a reduce algorithm. I have four versions of using shared memory. The first one fails, but the other three work.

I use the wk array to see more what's going on. I was surprised to see that changing the shared_mem creation also changes the result in the wk array.

The other thing is the conditionals. These were intended to handle the first run case when M or N are smaller than the size of the shared array. It turns out we don't need them when the shared array is initialized uniformly. But the code should work the same with or without them. ANYWAY, the interesting part here is that if I switch to the unconditioned code (commented), the code works even with the first shared_mem version.

using AMDGPU

function _2d_red_kernel((M, N), in, out, wk)
    shared_mem = @ROCStaticLocalArray(eltype(out), (16,16))
    # shared_mem = @ROCStaticLocalArray(eltype(out), (16,16), false)
    # shared_mem = @ROCDynamicLocalArray(eltype(out), (16,16))
    # shared_mem = @ROCDynamicLocalArray(eltype(out), (16,16), false)
    i = workitemIdx().x
    j = workitemIdx().y

    tmp = out[1]
    for ci in CartesianIndices((i:16:M, j:16:N))
        tmp += in[Tuple(ci)...]
    end
    shared_mem[i,j] = tmp
    wk[i,j] = tmp
    for n in (8, 4, 2, 1)
        AMDGPU.sync_workgroup()
        if (i <= n && j <= n)
            # shared_mem[i,j] += shared_mem[i+n, j+n]
            # shared_mem[i,j] += shared_mem[i, j+n]
            # shared_mem[i,j] += shared_mem[i+n, j]
            # wk[i,j] += wk[i+n, j+n]
            # wk[i,j] += wk[i, j+n]
            # wk[i,j] += wk[i+n, j]
            if (i + n <= M && j + n <= N)
                shared_mem[i,j] += shared_mem[i+n, j+n]
                wk[i,j] += wk[i+n, j+n]
            end
            if (i <= M && j + n <= N)
                shared_mem[i,j] += shared_mem[i, j+n]
                wk[i,j] += wk[i, j+n]
            end
            if (i + n <= M && j <= N)
                shared_mem[i,j] += shared_mem[i+n, j]
                wk[i,j] += wk[i+n, j]
            end
        end
    end
    if (i == 1 && j == 1)
        out[1] = shared_mem[i,j]
    end
    return nothing
end

function run_test(groups)
    shmem_size = 16 * 16 * sizeof(Float64)
    wk = AMDGPU.zeros(Int, (16,16))
    in = AMDGPU.ones(Int, groups)
    out = ROCArray([0])
    @sync @roc groupsize=(16,16) gridsize=(1, 1) shmem=shmem_size _2d_red_kernel(
        groups, in, out, wk)
    @show Base.Array(out)[], prod(groups)
    display(wk)
end

run_test((8,8))
run_test((32,32))
run_test((63,63))

I'll paste the rocminfo and AMDGPU.versioninfo() output below.

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          NO

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7272 12-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7272 12-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2900                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131774032(0x7dab650) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    131774032(0x7dab650) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131774032(0x7dab650) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131774032(0x7dab650) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    AMD EPYC 7272 12-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7272 12-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2900                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            24                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132061140(0x7df17d4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    132061140(0x7df17d4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132061140(0x7df17d4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132061140(0x7df17d4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-7f99cc8d20f3c038               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   10496                              
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 4                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-04b27105474ee68f               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   34048                              
  Internal Node ID:        3                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***
julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                             │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm-6.3.4/lib/llvm/bin/ld.lld                                                              │
│     +     │ Device Libraries │ -         │ /home/4pf/julia_depot-cousteau/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.3.42134 │ /opt/rocm-6.3.4/lib/libamdhip64.so                                                               │
│     +     │ rocBLAS          │ 4.3.0     │ /opt/rocm-6.3.4/lib/librocblas.so                                                                │
│     +     │ rocSOLVER        │ 3.27.0    │ /opt/rocm-6.3.4/lib/librocsolver.so                                                              │
│     +     │ rocSPARSE        │ 3.3.0     │ /opt/rocm-6.3.4/lib/librocsparse.so                                                              │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.3.4/lib/librocrand.so                                                                │
│     +     │ rocFFT           │ 1.0.31    │ /opt/rocm-6.3.4/lib/librocfft.so                                                                 │
│     +     │ MIOpen           │ 3.3.0     │ /opt/rocm-6.3.4/lib/libMIOpen.so                                                                 │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │               Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │        64 │ 31.984 GiB │    64.000 KiB │
│  2 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │        64 │ 31.984 GiB │    64.000 KiB │
└────┴────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘


PhilipFackler avatar Sep 18 '25 15:09 PhilipFackler

@vchuravy any updates, anything you need from our side, or should we consider this is a no go? @PhilipFackler provided a MWE and a lot of information. Happy to help as we are already providing tests at the consumer level (JACC) to broaden Julia adoption on our systems. We would like to know how to proceed as this is a regression and we had to lock AMDGPU to a very old version. Thanks in advance.

williamfgc avatar Nov 23 '25 02:11 williamfgc

One could try to further reduce the kernel having only one statement with and without if, compare perf and report generated llvm code in both cases (the expected and slow one) using something along the lines of https://github.com/JuliaGPU/AMDGPU.jl/issues/569#issuecomment-1857771859 ? I could also test things again on the CI machine GPU, and/or on LUMI as well.

luraess avatar Nov 26 '25 22:11 luraess

So part of the issue seems around zeroinit which is not a part of the code I am super familiar with.

Some of it seems indeed quite strange https://github.com/JuliaGPU/AMDGPU.jl/blob/4e5ee8edbbea3e471ac17ba2dd5c99c6ad6b77fb/src/device/gcn/memory_static.jl#L84

For ROCStaticLocalArray we seem to rely on a compiler feature to provide zero initialized memory.

@PhilipFackler could you use @device_code dir="dump" to get all the intermediate codes for the four different cases?

vchuravy avatar Nov 27 '25 10:11 vchuravy