composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

[Issue]: GPU verification of example problems fails

Open mysoreanoop opened this issue 8 months ago • 1 comments

Problem Description

Running example binaries generated within composable_kernel/build/bin with GPU verification fails. CPU verification succeeds. See below:

Image

PS: ROCm version 6.3.0 not 6.0.0

Operating System

Ubuntu 22.04.2 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

Other

No response

ROCm Version

ROCm 6.0.0

ROCm Component

Composable Kernel

Steps to Reproduce

Attached screenshot

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

*******                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Agent 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  *******                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Name:                    gfx942                                                                                                                                                                                                                                                                                                                                                                                                                                                          Uuid:                    GPU-0502bfff113277c6                                                                                                                                                                                                                                                                                                                                                                                                                                            Marketing Name:          AMD Instinct MI300X                                                                                                                                                                                                                                                                                                                                                                                                                                             Vendor Name:             AMD                                                                                                                                                                                                                                                                                                                                                                                                                                                             Feature:                 KERNEL_DISPATCH                                                                                                                                                                                                                                                                                                                                                                                                                                                 Profile:                 BASE_PROFILE                                                                                                                                                                                                                                                                                                                                                                                                                                                    Float Round Mode:        NEAR                                                                                                                                                                                                                                                                                                                                                                                                                                                            Max Queue Number:        128(0x80)                                                                                                                                                                                                                                                                                                                                                                                                                                                       Queue Min Size:          64(0x40)                                                                                                                                                                                                                                                                                                                                                                                                                                                        Queue Max Size:          131072(0x20000)                                                                                                                                                                                                                                                                                                                                                                                                                                                 Queue Type:              MULTI                                                                                                                                                                                                                                                                                                                                                                                                                                                           Node:                    8                                                                                                                                                                                                                                                                                                                                                                                                                                                               Device Type:             GPU                                                                                                                                                                                                                                                                                                                                                                                                                                                             Cache Info:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                L1:                      32(0x20) KB                                                                                                                                                                                                                                                                                                                                                                                                                                                     L2:                      4096(0x1000) KB                                                                                                                                                                                                                                                                                                                                                                                                                                                 L3:                      262144(0x40000) KB                                                                                                                                                                                                                                                                                                                                                                                                                                            Chip ID:                 29857(0x74a1)                                                                                                                                                                                                                                                                                                                                                                                                                                                   ASIC Revision:           1(0x1)                                                                                                                                                                                                                                                                                                                                                                                                                                                          Cacheline Size:          64(0x40)                                                                                                                                                                                                                                                                                                                                                                                                                                                        Max Clock Freq. (MHz):   2100                                                                                                                                                                                                                                                                                                                                                                                                                                                            BDFID:                   48896                                                                                                                                                                                                                                                                                                                                                                                                                                                           Internal Node ID:        8                                                                                                                                                                                                                                                                                                                                                                                                                                                               Compute Unit:            304                                                                                                                                                                                                                                                                                                                                                                                                                                                             SIMDs per CU:            4                                                                                                                                                                                                                                                                                                                                                                                                                                                               Shader Engines:          32                                                                                                                                                                                                                                                                                                                                                                                                                                                              Shader Arrs. per Eng.:   1                                                                                                                                                                                                                                                                                                                                                                                                                                                               WatchPts on Addr. Ranges:4                                                                                                                                                                                                                                                                                                                                                                                                                                                               Coherent Host Access:    FALSE                                                                                                                                                                                                                                                                                                                                                                                                                                                           Memory Properties:                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Features:                KERNEL_DISPATCH                                                                                                                                                                                                                                                                                                                                                                                                                                                 Fast F16 Operation:      TRUE                                                                                                                                                                                                                                                                                                                                                                                                                                                            Wavefront Size:          64(0x40)                                                                                                                                                                                                                                                                                                                                                                                                                                                        Workgroup Max Size:      1024(0x400)                                                                                                                                                                                                                                                                                                                                                                                                                                                     Workgroup Max Size per Dimension:                                                                                                                                                                                                                                                                                                                                                                                                                                                          x                        1024(0x400)                                                                                                                                                                                                                                                                                                                                                                                                                                                     y                        1024(0x400)                                                                                                                                                                                                                                                                                                                                                                                                                                                     z                        1024(0x400)                                                                                                                                                                                                                                                                                                                                                                                                                                                   Max Waves Per CU:        32(0x20)                                                                                                                                                                                                                                                                                                                                                                                                                                                        Max Work-item Per CU:    2048(0x800)                                                                                                                                                                                                                                                                                                                                                                                                                                                     Grid Max Size:           4294967295(0xffffffff)                                                                                                                                                                                                                                                                                                                                                                                                                                          Grid Max Size per Dimension:                                                                                                                                                                                                                                                                                                                                                                                                                                                               x                        4294967295(0xffffffff)                                                                                                                                                                                                                                                                                                                                                                                                                                          y                        4294967295(0xffffffff)                                                                                                                                                                                                                                                                                                                                                                                                                                          z                        4294967295(0xffffffff)                                                                                                                                                                                                                                                                                                                                                                                                                                        Max fbarriers/Workgrp:   32                                                                                                                                                                                                                                                                                                                                                                                                                                                              Packet Processor uCode:: 166                                                                                                                                                                                                                                                                                                                                                                                                                                                             SDMA engine uCode::      22                                                                                                                                                                                                                                                                                                                                                                                                                                                              IOMMU Support::          None                                                                                                                                                                                                                                                                                                                                                                                                                                                            Pool Info:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Pool 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Segment:                 GLOBAL; FLAGS: COARSE GRAINED                                                                                                                                                                                                                                                                                                                                                                                                                                   Size:                    201310208(0xbffc000) KB                                                                                                                                                                                                                                                                                                                                                                                                                                         Allocatable:             TRUE                                                                                                                                                                                                                                                                                                                                                                                                                                                            Alloc Granule:           4KB                                                                                                                                                                                                                                                                                                                                                                                                                                                             Alloc Recommended Granule:2048KB                                                                                                                                                                                                                                                                                                                                                                                                                                                         Alloc Alignment:         4KB                                                                                                                                                                                                                                                                                                                                                                                                                                                             Accessible by all:       FALSE                                                                                                                                                                                                                                                                                                                                                                                                                                                         Pool 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED                                                                                                                                                                                                                                                                                                                                                                                                                            Size:                    201310208(0xbffc000) KB                                                                                                                                                                                                                                                                                                                                                                                                                                         Allocatable:             TRUE                                                                                                                                                                                                                                                                                                                                                                                                                                                            Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32

Additional Information

No response

mysoreanoop avatar Mar 31 '25 06:03 mysoreanoop

Hi @mysoreanoop. Internal ticket has been created to investigate this issue. Thanks!

ppanchad-amd avatar Mar 31 '25 14:03 ppanchad-amd

Unlike some ck_tile examples, the 01_gemm examples did not handle 0 0 0 as stride arguments well. This resulted in crashing of reference gpu kernel. A PR is in progress to fix that.

AviralGoelAMD avatar May 30 '25 23:05 AviralGoelAMD

PR is merged.

AviralGoelAMD avatar Jun 03 '25 19:06 AviralGoelAMD