Regression with AMDGPU.jl v1.3.4 on JACC parallel_reduce CI
Questionnaire
-
Does ROCm works for you outside of Julia, e.g. C/C++/Python? Yes
-
Post output of
rocminfo.
$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: NO
*******
Agent 3
*******
Name: gfx908
Uuid: GPU-7f99cc8d20f3c038
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 10496
Internal Node ID: 2
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
- Post output of
AMDGPU.versioninfo()if possible.
# paste the output of `AMDGPU.versioninfo()` here
julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬────────────────────────────────────
│ Available │ Name │ Version │ Path ⋯
├───────────┼──────────────────┼───────────┼────────────────────────────────────
│ + │ LLD │ - │ /opt/rocm/llvm/bin/ld.lld ⋯
│ + │ Device Libraries │ - │ /home/wfg/.julia-cousteau/artifac ⋯
│ + │ HIP │ 6.3.42134 │ /opt/rocm/lib/libamdhip64.so ⋯
│ + │ rocBLAS │ 4.3.0 │ /opt/rocm/lib/librocblas.so ⋯
│ + │ rocSOLVER │ 3.27.0 │ /opt/rocm/lib/librocsolver.so ⋯
│ + │ rocSPARSE │ 3.3.0 │ /opt/rocm/lib/librocsparse.so ⋯
│ + │ rocRAND │ 2.10.5 │ /opt/rocm/lib/librocrand.so ⋯
│ + │ rocFFT │ 1.0.31 │ /opt/rocm/lib/librocfft.so ⋯
│ + │ MIOpen │ 3.3.0 │ /opt/rocm/lib/libMIOpen.so ⋯
└───────────┴──────────────────┴───────────┴────────────────────────────────────
1 column omitted
[ Info: AMDGPU devices
┌────┬────────────────────┬────────────────────────┬───────────┬────────────┬───
│ Id │ Name │ GCN arch │ Wavefront │ Memory │ ⋯
├────┼────────────────────┼────────────────────────┼───────────┼────────────┼───
│ 1 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │ 64 │ 31.984 GiB │ ⋯
│ 2 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │ 64 │ 31.984 GiB │ ⋯
└────┴────────────────────┴────────────────────────┴───────────┴────────────┴───
1 column omitted
Reproducing the bug
- Describe what's not working. Regression with AMDGPU v1.3.4 on JACC's parallel_reduce, works with AMDGPU v1.3.3 on MI100. CI logs have full information. Need to investigate further inside parallel_reduce, but this is only reproducible with the new version of AMDGPU.
errors are of the kind:
reduce: Test Failed at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/unittests.jl:166
Expression: mxd == maximum(ah2)
Evaluated: 3.3822674352068316 == 5.468821633677877
See JACC.jl issue
- Provide MWE to reproduce it (if possible). Please see above for JACC.jl CI.
Please isolate this to a fully JACC free MWE. The changes in 1.3.4 are very limited https://github.com/JuliaGPU/AMDGPU.jl/releases/tag/v1.3.4 and the most likely to be relevant one is #783 which adds memory fence semantics to sync_workgroup something many users assumed it had, but it indeed lacked.
Provide MWE to reproduce it (if possible). Please see above for JACC.jl CI.
That is the opposite of a MWE.
@vchuravy thanks, will do and hope to contribute to test coverage.
What's the status of this given we are now at AMDGPU 2?
Finally getting around to this. And yes, still seeing a problem with v2.1.0. In the following code, I have a simplified version of part of a reduce algorithm. I have four versions of using shared memory. The first one fails, but the other three work.
I use the wk array to see more what's going on. I was surprised to see that changing the shared_mem creation also changes the result in the wk array.
The other thing is the conditionals. These were intended to handle the first run case when M or N are smaller than the size of the shared array. It turns out we don't need them when the shared array is initialized uniformly. But the code should work the same with or without them. ANYWAY, the interesting part here is that if I switch to the unconditioned code (commented), the code works even with the first shared_mem version.
using AMDGPU
function _2d_red_kernel((M, N), in, out, wk)
shared_mem = @ROCStaticLocalArray(eltype(out), (16,16))
# shared_mem = @ROCStaticLocalArray(eltype(out), (16,16), false)
# shared_mem = @ROCDynamicLocalArray(eltype(out), (16,16))
# shared_mem = @ROCDynamicLocalArray(eltype(out), (16,16), false)
i = workitemIdx().x
j = workitemIdx().y
tmp = out[1]
for ci in CartesianIndices((i:16:M, j:16:N))
tmp += in[Tuple(ci)...]
end
shared_mem[i,j] = tmp
wk[i,j] = tmp
for n in (8, 4, 2, 1)
AMDGPU.sync_workgroup()
if (i <= n && j <= n)
# shared_mem[i,j] += shared_mem[i+n, j+n]
# shared_mem[i,j] += shared_mem[i, j+n]
# shared_mem[i,j] += shared_mem[i+n, j]
# wk[i,j] += wk[i+n, j+n]
# wk[i,j] += wk[i, j+n]
# wk[i,j] += wk[i+n, j]
if (i + n <= M && j + n <= N)
shared_mem[i,j] += shared_mem[i+n, j+n]
wk[i,j] += wk[i+n, j+n]
end
if (i <= M && j + n <= N)
shared_mem[i,j] += shared_mem[i, j+n]
wk[i,j] += wk[i, j+n]
end
if (i + n <= M && j <= N)
shared_mem[i,j] += shared_mem[i+n, j]
wk[i,j] += wk[i+n, j]
end
end
end
if (i == 1 && j == 1)
out[1] = shared_mem[i,j]
end
return nothing
end
function run_test(groups)
shmem_size = 16 * 16 * sizeof(Float64)
wk = AMDGPU.zeros(Int, (16,16))
in = AMDGPU.ones(Int, groups)
out = ROCArray([0])
@sync @roc groupsize=(16,16) gridsize=(1, 1) shmem=shmem_size _2d_red_kernel(
groups, in, out, wk)
@show Base.Array(out)[], prod(groups)
display(wk)
end
run_test((8,8))
run_test((32,32))
run_test((63,63))
I'll paste the rocminfo and AMDGPU.versioninfo() output below.
$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: NO
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7272 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7272 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2900
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131774032(0x7dab650) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131774032(0x7dab650) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131774032(0x7dab650) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131774032(0x7dab650) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: AMD EPYC 7272 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7272 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2900
BDFID: 0
Internal Node ID: 1
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132061140(0x7df17d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 132061140(0x7df17d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132061140(0x7df17d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132061140(0x7df17d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 3
*******
Name: gfx908
Uuid: GPU-7f99cc8d20f3c038
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 10496
Internal Node ID: 2
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 4
*******
Name: gfx908
Uuid: GPU-04b27105474ee68f
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 34048
Internal Node ID: 3
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name │ Version │ Path │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ + │ LLD │ - │ /opt/rocm-6.3.4/lib/llvm/bin/ld.lld │
│ + │ Device Libraries │ - │ /home/4pf/julia_depot-cousteau/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│ + │ HIP │ 6.3.42134 │ /opt/rocm-6.3.4/lib/libamdhip64.so │
│ + │ rocBLAS │ 4.3.0 │ /opt/rocm-6.3.4/lib/librocblas.so │
│ + │ rocSOLVER │ 3.27.0 │ /opt/rocm-6.3.4/lib/librocsolver.so │
│ + │ rocSPARSE │ 3.3.0 │ /opt/rocm-6.3.4/lib/librocsparse.so │
│ + │ rocRAND │ 2.10.5 │ /opt/rocm-6.3.4/lib/librocrand.so │
│ + │ rocFFT │ 1.0.31 │ /opt/rocm-6.3.4/lib/librocfft.so │
│ + │ MIOpen │ 3.3.0 │ /opt/rocm-6.3.4/lib/libMIOpen.so │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘
[ Info: AMDGPU devices
┌────┬────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │ Name │ GCN arch │ Wavefront │ Memory │ Shared Memory │
├────┼────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│ 1 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │ 64 │ 31.984 GiB │ 64.000 KiB │
│ 2 │ AMD Instinct MI100 │ gfx908:sramecc+:xnack- │ 64 │ 31.984 GiB │ 64.000 KiB │
└────┴────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
@vchuravy any updates, anything you need from our side, or should we consider this is a no go? @PhilipFackler provided a MWE and a lot of information. Happy to help as we are already providing tests at the consumer level (JACC) to broaden Julia adoption on our systems. We would like to know how to proceed as this is a regression and we had to lock AMDGPU to a very old version. Thanks in advance.
One could try to further reduce the kernel having only one statement with and without if, compare perf and report generated llvm code in both cases (the expected and slow one) using something along the lines of https://github.com/JuliaGPU/AMDGPU.jl/issues/569#issuecomment-1857771859 ? I could also test things again on the CI machine GPU, and/or on LUMI as well.
So part of the issue seems around zeroinit which is not a part of the code I am super familiar with.
Some of it seems indeed quite strange https://github.com/JuliaGPU/AMDGPU.jl/blob/4e5ee8edbbea3e471ac17ba2dd5c99c6ad6b77fb/src/device/gcn/memory_static.jl#L84
For ROCStaticLocalArray we seem to rely on a compiler feature to provide zero initialized memory.
@PhilipFackler could you use @device_code dir="dump" to get all the intermediate codes for the four different cases?