composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Jeonghyun/ckb add remining fwd conv device ops

Open JH-Leon-KIM-AMD opened this issue 1 month ago • 0 comments

Proposed changes

This PR adds factory support for the remaining forward convolution device operations in CK Builder.

Jira Ticket: https://amd-hub.atlassian.net/jira/software/c/projects/ALMIOPEN/boards/319/backlog?selectedIssue=ALMIOPEN-350

Task 350 - Add remaining forward convolution device operations:

  1. DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK

    • Added DL (Direct Load) factory specialization for NHWC layout-specific convolution
    • Created new DL-specific algorithm descriptor with 30 template parameters
    • Added test helper function and 3 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
    • All tests passing (15/15 builder tests)
  2. DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor

    • Added Large_Tensor factory specialization for N-dimension splitting (large-than-memory tensors)
    • Implemented macro collision workaround using pragma push/pop for GridwiseGemmTemplateParameters
    • Reuses existing XDL algorithm descriptor (42 identical template parameters)
    • Added test helper function and 2 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
    • All tests passing (15/15 builder tests)

This completes Task 350 - all 4 forward convolution device operations are now supported in CK Builder:

  • ✅ DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (by Ville)
  • ✅ DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (by Ville)
  • ✅ DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK (this PR)
  • ✅ DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor (this PR)

Checklist

  • [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run
  • [x] I have added inline documentation which enables the maintainers with understanding the motivation
  • [ ] I have removed the stale documentation which is no longer relevant after this pull request
  • [ ] (If this is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • [x] I have run clang-format on all changed files
  • [x] Any dependent changes have been merged

Discussion

Design Decisions

1. DL Algorithm Descriptor (30 parameters)

  • DL uses VALU instructions instead of XDL matrix cores, requiring different parameter structure
  • Fixed NHWC_KYXC_NHWK layout only (not flexible like XDL)
  • Created separate descriptor type: ConvAlgorithm_DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK
  • New DL-specific concepts: DlThreadConfigDescriptor, DlThreadClusterDescriptor, DlBlockTransferK0M0M1K1Descriptor, etc.

2. Large_Tensor Descriptor Reuse (42 parameters)

  • Large_Tensor has identical template parameters to regular XDL CShuffle
  • Only difference: internal SplitN=true flag in device operation (not exposed in factory interface)
  • Reuses existing descriptor: ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle
  • No new descriptor or concepts needed

3. Macro Collision Workaround

  • Both device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp (line 41) and device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp (line 51) define GridwiseGemmTemplateParameters macro without #undef
  • Used #pragma push_macro/#pragma pop_macro to isolate Large_Tensor header's macro scope
  • This may need to fix on CK headers

JH-Leon-KIM-AMD avatar Nov 04 '25 15:11 JH-Leon-KIM-AMD