Jeonghyun/ckb add remining fwd conv device ops

Open JH-Leon-KIM-AMD opened this issue 1 month ago • 0 comments

Proposed changes

This PR adds factory support for the remaining forward convolution device operations in CK Builder.

Jira Ticket: https://amd-hub.atlassian.net/jira/software/c/projects/ALMIOPEN/boards/319/backlog?selectedIssue=ALMIOPEN-350

Task 350 - Add remaining forward convolution device operations:

DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK
- Added DL (Direct Load) factory specialization for NHWC layout-specific convolution
- Created new DL-specific algorithm descriptor with 30 template parameters
- Added test helper function and 3 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
- All tests passing (15/15 builder tests)
DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Added Large_Tensor factory specialization for N-dimension splitting (large-than-memory tensors)
- Implemented macro collision workaround using pragma push/pop for GridwiseGemmTemplateParameters
- Reuses existing XDL algorithm descriptor (42 identical template parameters)
- Added test helper function and 2 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
- All tests passing (15/15 builder tests)

This completes Task 350 - all 4 forward convolution device operations are now supported in CK Builder:

[x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
[ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run
[x] I have added inline documentation which enables the maintainers with understanding the motivation
[ ] I have removed the stale documentation which is no longer relevant after this pull request
[ ] (If this is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
[x] I have run clang-format on all changed files
[x] Any dependent changes have been merged

1. DL Algorithm Descriptor (30 parameters)

DL uses VALU instructions instead of XDL matrix cores, requiring different parameter structure
Fixed NHWC_KYXC_NHWK layout only (not flexible like XDL)
Created separate descriptor type: ConvAlgorithm_DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK
New DL-specific concepts: DlThreadConfigDescriptor, DlThreadClusterDescriptor, DlBlockTransferK0M0M1K1Descriptor, etc.

2. Large_Tensor Descriptor Reuse (42 parameters)

Large_Tensor has identical template parameters to regular XDL CShuffle
Only difference: internal SplitN=true flag in device operation (not exposed in factory interface)
Reuses existing descriptor: ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle
No new descriptor or concepts needed

3. Macro Collision Workaround

Both device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp (line 41) and device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp (line 51) define GridwiseGemmTemplateParameters macro without #undef
Used #pragma push_macro/#pragma pop_macro to isolate Large_Tensor header's macro scope
This may need to fix on CK headers

Nov 04 '25 15:11 JH-Leon-KIM-AMD