composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Results 276 composable_kernel issues
Sort by recently updated
recently updated
newest added

Sequence length 1 is extremely important for decoding (ASR, text generation, etc) In onnxruntime, we found the rocblas gemm + sofmax kernel +rocblas gemm is much faster for this case,...

enhancement

I am trying to build 0345963eef4f92e9c5eab608bb8557b5463a1dcb on CSC Lumi. I installed the latest release of CMake (3.25.1) to no avail. How do I fix this? I do not see this...

- Use DPP8 Gemm utilizes implemented by @geyyer in (https://github.com/ROCmSoftwarePlatform/composable_kernel/pull/657/) to finish a fp16 Gemm example - @bwroblew

https://github.com/ROCmSoftwarePlatform/composable_kernel/pull/261#discussion_r883726267

code quality

*Full blown fix for Workaround #687 * We need to fix them as a quality improvement, but for now suppressing this warning in immediate releases: http://compiler-ci.amd.com/blue/rest/organizations/jenkins/pipelines/compiler-psdb-amd-stg-open/runs/2540/nodes/282/steps/3202/log/?start=0 e.g. majority of the...

good first issue
code quality
urgency_medium

The ones we need for Transformer Engine are the following: 1) CUBLASLT_EPILOGUE_GELU_AUX step 1 : matrix multiplication step 2 : apply gelu step 3 : store the result to seperate...

enhancement

This is a prototype of how MIGraphX would like to run CK device ops using runtime compilation. The Descriptor class will take in naive_tensor_descriptors constructed by MIGraphX, and then be...

Updated judgement of dropout. Performance is improved when p_drop = 0. G0 G1 M K 54 16 512 64 : before : 4.49336 ms, 32.2599 TFlops, 101.206 GB/s -> now:...