composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Results 276 composable_kernel issues
Sort by recently updated
recently updated
newest added

## Proposed changes Support fused MoE with up gemm. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR....

## Proposed changes We have done some optimizations on branch `ck_tile/support-vllm-kcache-layout`. It's time to sync those changes back to `develop` (exclude V colum major vector load). - Add `kPadHeadDimQ`=`kPadHeadDimV`=**false** fmha...

## Proposed changes 1. Simpler kernel example for layernorm 2. use store_tile_raw for Default2DEpilogueProblem to improve performance ## Checklist use following command to check performance make -j tile_layernorm2d_fwd && ./bin/tile_layernorm2d_fwd...

Update a8w8 kernel library Update flush cache timing api

- Input FP32, ComputeType: TF32, OutputType: F32

Add int4+scale based on Zhang, Jing pk_i4. Compile pass, function pass. based on zhangjing 's pr https://github.com/ROCm/composable_kernel/pull/1572

noCI

Implement new data movement and mma layout inside universal gemm.

enhancement