composable_kernel
composable_kernel copied to clipboard
Other block sizes for DL and WMMA kernels
XDL kernels are instantiated with various different BlockSize/MPerBlock/NPerBlock/etc. template parameters in the library, so it's easy to pick a good set of parameters for a particular convolution. But unfortunately the DL and WMMA kernels (e.g. DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK) are instantiated with a single set of template parameters (apart from the spatial dimensions), which often performs poorly for certain convolution sizes. This one: https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/76fcdc60e9ef848d23527079ad59ca6f4040ca59/library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd/device_grouped_conv2d_fwd_dl_gnhwc_gkyxc_gnhwk_f16_instance.cpp#L51
The problem is that without a very deep knowledge about how these kernels work, it's extremely difficult to come up with different combinations of template parameters which are also valid.
For example, the DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK kernel is instantiated with only BlockSize=256, MPerBlock=128, NPerBlock=128. I need a valid set of parameters for an FP16 2D convolution where NPerBlock=64. What other template parameters should be changed as well, and to which exact values, to have a valid configuration?
Yes. Currently, CK exposes all tuning parameters, many of them are HW/Algorithm related, which are difficult to be followed by the external developers/users. Currently, we are adding more instances for DL kernels. Will keep you informed.
Hi @zjing14 , I'm trying to understand the parameters in basic gemm functions such as Block Size, M/N/K per Block, M/N per XDL and M/Nxdl per wave, etc.. in order to find the best kernel under different problem sizes. Is there an official document somewhere about what these mean? Thank you :)
I also tried to map these concepts to CUDA case, where I think it might be that: (1) M/N/K per Block = Thread Block Tile, task is assigned to a AMD GPU Compute Unit (2) M/NperXDL * M/NXDLperWave = Warp Tile, task is assigned to the wavefront (64threads, right?) (3) M/NXDL per wave = OP Tile, task is handled by matrix cores. Am I understanding the parameters correctly?
A code example here: https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/5356c4a943a35e74d7cdc69486afcb8703b9a59a/example/01_gemm/gemm_xdl_fp16.cpp#L37C1-L37C1
@TerrenceZhangX Hello, Do you understand the previous parameter problems?
@atafra @TerrenceZhangX Do you still need assistance with this ticket? Thanks!
Hi @ppanchad-amd , yes, appreciate it if you can help with this. Thanks!
@TerrenceZhangX I have created an internal ticket to assist with this ticket. Thanks!
Hi @TerrenceZhangX and @atafra,
Per @zjing14 's comments it is not meant that external developers customize the kernel template parameters themselves. You might find some more information on a related information request issue (https://github.com/ROCm/MIOpen/issues/1092), but it's recommended that you use the existing DL kernel instances.