composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Other block sizes for DL and WMMA kernels

Open atafra opened this issue 1 year ago • 3 comments

XDL kernels are instantiated with various different BlockSize/MPerBlock/NPerBlock/etc. template parameters in the library, so it's easy to pick a good set of parameters for a particular convolution. But unfortunately the DL and WMMA kernels (e.g. DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK) are instantiated with a single set of template parameters (apart from the spatial dimensions), which often performs poorly for certain convolution sizes. This one: https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/76fcdc60e9ef848d23527079ad59ca6f4040ca59/library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd/device_grouped_conv2d_fwd_dl_gnhwc_gkyxc_gnhwk_f16_instance.cpp#L51

The problem is that without a very deep knowledge about how these kernels work, it's extremely difficult to come up with different combinations of template parameters which are also valid.

For example, the DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK kernel is instantiated with only BlockSize=256, MPerBlock=128, NPerBlock=128. I need a valid set of parameters for an FP16 2D convolution where NPerBlock=64. What other template parameters should be changed as well, and to which exact values, to have a valid configuration?

atafra avatar Mar 10 '23 00:03 atafra

Yes. Currently, CK exposes all tuning parameters, many of them are HW/Algorithm related, which are difficult to be followed by the external developers/users. Currently, we are adding more instances for DL kernels. Will keep you informed.

zjing14 avatar Mar 13 '23 16:03 zjing14

Hi @zjing14 , I'm trying to understand the parameters in basic gemm functions such as Block Size, M/N/K per Block, M/N per XDL and M/Nxdl per wave, etc.. in order to find the best kernel under different problem sizes. Is there an official document somewhere about what these mean? Thank you :)

I also tried to map these concepts to CUDA case, where I think it might be that: (1) M/N/K per Block = Thread Block Tile, task is assigned to a AMD GPU Compute Unit (2) M/NperXDL * M/NXDLperWave = Warp Tile, task is assigned to the wavefront (64threads, right?) (3) M/NXDL per wave = OP Tile, task is handled by matrix cores. Am I understanding the parameters correctly?

A code example here: https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/5356c4a943a35e74d7cdc69486afcb8703b9a59a/example/01_gemm/gemm_xdl_fp16.cpp#L37C1-L37C1

TerrenceZhangX avatar Nov 14 '23 04:11 TerrenceZhangX

@TerrenceZhangX Hello, Do you understand the previous parameter problems?

xiaobo1025 avatar Feb 23 '24 10:02 xiaobo1025

@atafra @TerrenceZhangX Do you still need assistance with this ticket? Thanks!

ppanchad-amd avatar Aug 20 '24 20:08 ppanchad-amd

Hi @ppanchad-amd , yes, appreciate it if you can help with this. Thanks!

TerrenceZhangX avatar Aug 20 '24 21:08 TerrenceZhangX

@TerrenceZhangX I have created an internal ticket to assist with this ticket. Thanks!

ppanchad-amd avatar Aug 21 '24 14:08 ppanchad-amd

Hi @TerrenceZhangX and @atafra,

Per @zjing14 's comments it is not meant that external developers customize the kernel template parameters themselves. You might find some more information on a related information request issue (https://github.com/ROCm/MIOpen/issues/1092), but it's recommended that you use the existing DL kernel instances.

jamesxu2 avatar Aug 29 '24 20:08 jamesxu2