Tensile
Tensile copied to clipboard
[Feature]: about 7900xtx benchmark.
Suggestion Description
Benchmark show bellow, the performance of the fp32 is very bad, the theoretical performance is 61tflops,actual test is 28tflops less than half of the theoretical. RTX4090 fp32 has 74tflops (theoretical 81t). Is there any room for further improvement, or are there any suggestions for optimization?
#####################
test platform:
Ubuntu 22.04.3 LTS
kernel: 6.3.7+ from https://github.com/ROCm/ROCK-Kernel-Driver.git branch roc-6.0.x
ROCm: 5.7.1
rocBLAS: https://github.com/ROCm/rocBLAS.git branch develop
rocm-libs:
Version: 5.7.0.50700-63~22.04
Priority: optional
Section: devel
Maintainer: ROCm Libs Support <[email protected]>
Installed-Size: 13.3 kB
Depends: hipblas (= 1.1.0.50700-63~22.04), hipblaslt (= 0.3.0.50700-63~22.04), hipfft (= 1.0.12.50700-63~22.04), hipsolver (= 1.8.1.50700-63~22.04), hipsparse (= 2.3.8.50700-63~22.04), miopen-hip (= 2.20.0.50700-63~22.04), rccl (= 2.17.1.50700-63~22.04), rocalution (= 2.1.11.50700-63~22.04), rocblas (= 3.1.0.50700-63~22.04), rocfft (= 1.0.23.50700-63~22.04), rocrand (= 2.10.17.50700-63~22.04), rocsolver (= 3.23.0.50700-63~22.04), rocsparse (= 2.5.4.50700-63~22.04), rocm-core (= 5.7.0.50700-63~22.04), hipblas-dev (= 1.1.0.50700-63~22.04), hipblaslt-dev (= 0.3.0.50700-63~22.04), hipcub-dev (= 2.13.1.50700-63~22.04), hipfft-dev (= 1.0.12.50700-63~22.04), hipsolver-dev (= 1.8.1.50700-63~22.04), hipsparse-dev (= 2.3.8.50700-63~22.04), miopen-hip-dev (= 2.20.0.50700-63~22.04), rccl-dev (= 2.17.1.50700-63~22.04), rocalution-dev (= 2.1.11.50700-63~22.04), rocblas-dev (= 3.1.0.50700-63~22.04), rocfft-dev (= 1.0.23.50700-63~22.04), rocprim-dev (= 2.13.1.50700-63~22.04), rocrand-dev (= 2.10.17.50700-63~22.04), rocsolver-dev (= 3.23.0.50700-63~22.04), rocsparse-dev (= 2.5.4.50700-63~22.04), rocthrust-dev (= 2.18.0.50700-63~22.04), rocwmma-dev (= 1.2.0.50700-63~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1,012 B
APT-Sources: https://repo.radeon.com/rocm/apt/5.7 jammy/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack
#########################
FP16 benchmark:
./rocblas-bench -f gemm_strided_batched -r f16_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --stride_a 16384 --ldb 4096 --stride_b 16384 --beta 0 --ldc 4096 --stride_c 16777216 --batch_count 16
rocBLAS-commit-hash: 0e8898671415956972b82ba31febe192802a2bec
Tensile-commit-hash: 4532f2430eeb6ab54270647871625810932f63f7
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 7900 XTX gfx1100
with 25.8 GB memory, max. SCLK 2526 MHz, max. MCLK 1249 MHz, memoryBusWidth 48 Bytes, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------
rocBLAS info: Time taken to complete rocBLAS library initialization is 1907 milliseconds.
rocBLAS info: maximum library size per device is 0.530579 GB.
rocblas-bench INFO: stride_d < min_stride_d, set stride_d = 524288
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
N,N,4096,4096,4096,1,4096,16384,0,4096,16384,4096,16777216,16, 113596, 19358.2
#########################
FP32 benchmark:
./rocblas-bench -f gemm_strided_batched -r f32_r --transposeA N --transposeB N -m 4096 -n 4096 -k 4096 --alpha 1 --lda 4096 --stri
de_a 16384 --ldb 4096 --stride_b 16384 --beta 0 --ldc 4096 --stride_c 16777216 --batch_count 16
rocBLAS-commit-hash: 0e8898671415956972b82ba31febe192802a2bec
Tensile-commit-hash: 4532f2430eeb6ab54270647871625810932f63f7
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : Radeon RX 7900 XTX gfx1100
with 25.8 GB memory, max. SCLK 2526 MHz, max. MCLK 1249 MHz, memoryBusWidth 48 Bytes, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------
rocBLAS info: Time taken to complete rocBLAS library initialization is 1887 milliseconds.
rocBLAS info: maximum library size per device is 0.530579 GB.
rocblas-bench INFO: stride_d < min_stride_d, set stride_d = 524288
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
N,N,4096,4096,4096,1,4096,16384,0,4096,16384,4096,16777216,16, 27960.4, 78647.8
rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 1
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 3
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 2
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 2
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65871016(0x3ed1ca8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65871016(0x3ed1ca8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65871016(0x3ed1ca8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 4
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 3
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 3
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65985156(0x3eeda84) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65985156(0x3eeda84) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65985156(0x3eeda84) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 5
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 4
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 4
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 6
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 5
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 5
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 7
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 6
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 6
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 8
*******
Name: AMD EPYC 7402 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 7
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 7
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 0(0x0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 9
*******
Name: gfx1100
Uuid: GPU-b85e402c0fb3757f
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 8
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2526
BDFID: 768
Internal Node ID: 8
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 528
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Operating System
Ubuntu 22.04.3 LTS
GPU
7900xtx
ROCm Component
rocBLAS
Hello @Axl-zhang,
Thank you for bringing the performance concern regarding our recently supported architecture to our attention. Your feedback is instrumental in ensuring optimal functionality across all platforms.
To address this matter, I'll be promptly directing this issue to our specialized team for thorough investigation and analysis. Sometimes, certain transpose configurations, such as NT, undergo specific tuning efforts to enhance performance on varying architectures.
In the interim, I kindly suggest exploring different transpose configurations and comparing their impact on performance. Trying out alternatives like NT could potentially reveal improvements. Please feel free to report back on any findings or improvements observed with these adjustments. Your assistance in this matter is highly appreciated.
Thank you for your patience and cooperation as we work to resolve this issue.
Best regards, Wasiq
Assigning to Carson for gfx1100 fp32 Tensile tuning.
the headline 61TFlops spec of rdna3 is kindof a lie, to achieve this rdna3 adds limited fp32 dual issue capability over rdna2. This to use dual issue capability some stars have to align though:
- you need to be able to dual issue and do it by hand in asm
- or you need the compiler to opmize it in, llvm however is pretty terrible at optimizing dual issue in so this rarely just happens.
- unlike CDNA2 or fp16 dual issue in VEGA, RDNA3 can only dual issue some operations (like adds) but iiirc not mults so again most of the time no dual issue, def not in gemm
TLDR: except for very specific circumstances 30 ish TFlops is the max you can expect out of RDNA3
the headline 61TFlops spec of rdna3 is kindof a lie, to achieve this rdna3 adds limited fp32 dual issue capability over rdna2. This to use dual issue capability some stars have to align though:
you need to be able to dual issue and do it by hand in asm
- or you need the compiler to opmize it in, llvm however is pretty terrible at optimizing dual issue in so this rarely just happens.
unlike CDNA2 or fp16 dual issue in VEGA, RDNA3 can only dual issue some operations (like adds) but iiirc not mults so again most of the time no dual issue, def not in gemm
TLDR: except for very specific circumstances 30 ish TFlops is the max you can expect out of RDNA3
several blogs claim the clpeak test can achive 80% theoretical performance if use the wave64 ,but now navi3x Tensile set wavefonts is 32 only achive half theoretical performance
In wave64 mode the hardware can dual issue halfs of the wave, this dosent help you with operations that can not be dual issued at all though, as is the case if i understand the isa documentation correctly in gemm
In wave64 mode the hardware can dual issue halfs of the wave, this dosent help you with operations that can not be dual issued at all though, as is the case if i understand the isa documentation correctly in gemm wave64 mode was disable in clr due to crossline ,so rdna3 only use wave32 in hip but I find the handwrite dual issue instruction in early tensile pr https://github.com/ROCm/Tensile/pull/1625 and revert in https://github.com/ROCm/Tensile/pull/1683
Thanks again bringing this issue to our attention. We noticed that there hasn't been any activity on this issue for a while. To keep our issue tracker clean and focused on active matters, we will be closing this issue if there is no further activity within the next week.
If you still require assistance or believe this issue needs to remain open, please continue the discussion in this new location.
Thank you for your understanding and cooperation.