composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Results 276 composable_kernel issues
Sort by recently updated
recently updated
newest added

## Proposed changes The existing implementation produces incorrect results ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR....

## Proposed changes This PR integrates Universal GEMM into Device Grouped Gemm. Specifically, we replace: The _GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2_ in _device_grouped_gemm_xdl_splitk_cshuffle.hpp_ with _GridwiseGemm_xdl_cshuffle_v3_ We make corresponding changes to the struct Argument and...

## Proposed changes ### Why The current logic is very brittle, and can break down when we parse examples that have new order, or define some keys that were previously...

Currently, CK Tile GroupGEMM prepares metadata on the host, which requires transferring meta data between the device and host back and forth. https://github.com/ROCm/composable_kernel/blob/6b6fcd370bb2e5572422a1ca71d261df02b6263e/include/ck_tile/ops/gemm/kernel/grouped_gemm_kernel.hpp#L98 To avoid this overhead, we need groupGEMM...

Under Investigation

## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please...

- Recent Triton ping-pong scheduling shows a good performance speedup https://github.com/triton-lang/triton/pull/5018 - CK Tile GEMM should add ping-pong pipeline as a generic optimization

Under Investigation
feature request

## Proposed changes FlexAttention is a customization of Fused Multi-Head Attention where the attention scores are customizeable with a function `score_mod (score: float, batch_idx: int, head_idx: int, q_idx: int, v_idx:...

## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please...

## Proposed changes It is a new Composable Kernel Tile Operator that enables the cross GPU connection reduce on block thread level. There is one (multiple in future) master GPU...