composable_kernel
composable_kernel copied to clipboard
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
There are few things which need to be reconsidered. _Originally posted by @aosewski in https://github.com/ROCm/composable_kernel/pull/1845#pullrequestreview-2601124209_
Tried CK Tile GEMM with V3 pipeline (https://github.com/ROCm/composable_kernel/blob/develop/example/ck_tile/03_gemm/universal_gemm.cpp) for compute bound cases (i.e., M = 4096, N = 4096 and K = 4096), but get much worse performance than (https://github.com/ROCm/composable_kernel/blob/develop/example/01_gemm/gemm_xdl_bf16_v3.cpp)...
## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please...
### Problem Description All of the libraries are hardcoded to be statically built. ex/ https://github.com/ROCm/composable_kernel/blob/develop/library/src/tensor_operation_instance/gpu/CMakeLists.txt#L312 Fedora requires libraries to be shared and versioned. ### Operating System Fedora Rawhide ### CPU...
### Problem Description The logic to enable examples and testing is confusing https://github.com/ROCm/composable_kernel/blob/develop/CMakeLists.txt#L586 There should be a toplevel cmake parameter for something like BUILD_EXAMPLES. ### Operating System Fedora Rawhide ###...
**[Observations]**: when building CK for specific platforms such as `gfx1100`: > sudo CXX=/opt/rocm/bin/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_BUILD_TYPE=Release -DGPU_ARCHS="gfx1100" .. naturally `/opt/rocm/lib/cmake/composable_kernel/composable_kerneldevice_mha_operationsTargets.cmak` is **NOT* generated because the platform does not yet fully...
This pull request improves the documentation for building specific profilers in the `profiler/README.md`. The main change is the addition of instructions on how to filter which operations to compile using...
## Proposed changes - Added gtests for gemm operations for compiler's CI denoted them with compiler suffix. - The goal is to let the compiler run the tests related to...
## Proposed changes update ck mxfp4 moe in gfx950 for: 1. support block_m=32 in deepseek tp8 model. 2. impl moe gemm2 v1 pipe. ## Checklist Please put an `x` into...
## Proposed changes This PR enables preshuffling the quant weight matrix for preshuffled weight block scale gemm. ## Checklist Please put an `x` into the boxes that apply. You can...