composable_kernel
composable_kernel copied to clipboard
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
## Proposed changes Support fused MoE with up gemm. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR....
## Proposed changes We have done some optimizations on branch `ck_tile/support-vllm-kcache-layout`. It's time to sync those changes back to `develop` (exclude V colum major vector load). - Add `kPadHeadDimQ`=`kPadHeadDimV`=**false** fmha...
## Proposed changes 1. Simpler kernel example for layernorm 2. use store_tile_raw for Default2DEpilogueProblem to improve performance ## Checklist use following command to check performance make -j tile_layernorm2d_fwd && ./bin/tile_layernorm2d_fwd...
Update a8w8 kernel library Update flush cache timing api
- Input FP32, ComputeType: TF32, OutputType: F32
Add int4+scale based on Zhang, Jing pk_i4. Compile pass, function pass. based on zhangjing 's pr https://github.com/ROCm/composable_kernel/pull/1572
Implement new data movement and mma layout inside universal gemm.