composable_kernel
composable_kernel copied to clipboard
Jeonghyun/ckb add remining fwd conv device ops
Proposed changes
This PR adds factory support for the remaining forward convolution device operations in CK Builder.
Jira Ticket: https://amd-hub.atlassian.net/jira/software/c/projects/ALMIOPEN/boards/319/backlog?selectedIssue=ALMIOPEN-350
Task 350 - Add remaining forward convolution device operations:
-
DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK
- Added DL (Direct Load) factory specialization for NHWC layout-specific convolution
- Created new DL-specific algorithm descriptor with 30 template parameters
- Added test helper function and 3 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
- All tests passing (15/15 builder tests)
-
DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Added Large_Tensor factory specialization for N-dimension splitting (large-than-memory tensors)
- Implemented macro collision workaround using pragma push/pop for GridwiseGemmTemplateParameters
- Reuses existing XDL algorithm descriptor (42 identical template parameters)
- Added test helper function and 2 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
- All tests passing (15/15 builder tests)
This completes Task 350 - all 4 forward convolution device operations are now supported in CK Builder:
- ✅ DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (by Ville)
- ✅ DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (by Ville)
- ✅ DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK (this PR)
- ✅ DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor (this PR)
Checklist
- [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run
- [x] I have added inline documentation which enables the maintainers with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant after this pull request
- [ ] (If this is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
- [x] I have run
clang-formaton all changed files - [x] Any dependent changes have been merged
Discussion
Design Decisions
1. DL Algorithm Descriptor (30 parameters)
- DL uses VALU instructions instead of XDL matrix cores, requiring different parameter structure
- Fixed NHWC_KYXC_NHWK layout only (not flexible like XDL)
- Created separate descriptor type:
ConvAlgorithm_DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK - New DL-specific concepts:
DlThreadConfigDescriptor,DlThreadClusterDescriptor,DlBlockTransferK0M0M1K1Descriptor, etc.
2. Large_Tensor Descriptor Reuse (42 parameters)
- Large_Tensor has identical template parameters to regular XDL CShuffle
- Only difference: internal
SplitN=trueflag in device operation (not exposed in factory interface) - Reuses existing descriptor:
ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle - No new descriptor or concepts needed
3. Macro Collision Workaround
- Both
device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp(line 41) anddevice_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp(line 51) defineGridwiseGemmTemplateParametersmacro without#undef - Used
#pragma push_macro/#pragma pop_macroto isolate Large_Tensor header's macro scope - This may need to fix on CK headers