[Issue]: GPU verification of example problems fails
Problem Description
Running example binaries generated within composable_kernel/build/bin with GPU verification fails. CPU verification succeeds. See below:
PS: ROCm version 6.3.0 not 6.0.0
Operating System
Ubuntu 22.04.2 LTS (Jammy Jellyfish)
CPU
Intel(R) Xeon(R) Platinum 8480C
GPU
AMD Instinct MI300X
Other
No response
ROCm Version
ROCm 6.0.0
ROCm Component
Composable Kernel
Steps to Reproduce
Attached screenshot
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
******* Agent 9 ******* Name: gfx942 Uuid: GPU-0502bfff113277c6 Marketing Name: AMD Instinct MI300X Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 8 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 4096(0x1000) KB L3: 262144(0x40000) KB Chip ID: 29857(0x74a1) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2100 BDFID: 48896 Internal Node ID: 8 Compute Unit: 304 SIMDs per CU: 4 Shader Engines: 32 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 2048(0x800) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 166 SDMA engine uCode:: 22 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 201310208(0xbffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 201310208(0xbffc000) KB Allocatable: TRUE Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Additional Information
No response
Hi @mysoreanoop. Internal ticket has been created to investigate this issue. Thanks!
Unlike some ck_tile examples, the 01_gemm examples did not handle 0 0 0 as stride arguments well. This resulted in crashing of reference gpu kernel. A PR is in progress to fix that.
PR is merged.