composable_kernel
composable_kernel copied to clipboard
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
Hi there, I am wondering which impl or example I can refer to if I want to run: RELU(GEMM (A[M,K]_INT8, B[K,N]_INT8) + bias_INT8) = output_INT8. I tried example/03_gemm_bias_relu, but it...
Hello, I have some trouble to compile composable_kernel for my AMD GPU architecture (gfx1010) ``` cmake \ -D CMAKE_PREFIX_PATH=/opt/rocm \ -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -D CMAKE_CXX_FLAGS="-O3" \ -D CMAKE_BUILD_TYPE=Release \ -D...
[Reproduce] ``` CXX=/opt/rocm/llvm/bin/clang++ cmake -DCMAKE_CXX_COMPILER_LAUNCHER="${COMPILER_LAUNCHER}" -DCMAKE_PREFIX_PATH=/opt/rocm -DDTYPES="fp16;fp32;bf16" -DCMAKE_BUILD_TYPE=Release -DINSTANCES_ONLY=ON -DGPU_TARGETS="gfx1100" .. ``` hence `-DDTYPES="fp16;fp32;bf16"`, then ``` make -j$(nproc) ``` [Observation] ``` /home/junliu/composable_kernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_xdl_fixed_nk_f16_f8_f16_mk_kn_mn_instance.cpp:18:17: error: no type named 'f8_t' in namespace 'ck'...
This is example of complex tensor contraction based on 4 GEMM. All imaginary and real tensor are explicitly defined.
This depends on PR#1028. The new files modified are few: ``` modified: example/53_gemv_splitk/CMakeLists.txt modified: example/54_tall_and_skinny_gemm_splitk/CMakeLists.txt modified: example/54_tall_and_skinny_gemm_splitk/run_tall_and_skinny_gemm_splitk_example.inc modified: include/ck/host_utility/kernel_launch.hpp modified: include/ck/tensor_operation/gpu/device/impl/device_tall_and_skinny_gemm_splitk.hpp modified: include/ck/tensor_operation/gpu/grid/gridwise_tall_and_skinny_gemm_splitk.hpp conflict resolved: library/src/tensor_operation_instance/gpu/CMakeLists.txt ```
Tall and skinny GEMM & GEMV files are added for examples and ckprofiler to work.
FlashAttentionV1: forward kloop: [gridwise_batched_mha_fwd_xdl_cshuffle_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v1.hpp) backward kloop prototype1: [gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v1.hpp) backward kloop prototype2: [gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v2.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v2.hpp) FlashAttentionV2: forward kloop: [gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp) backward qloop from bottom to top prototype1: [gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_v1.hpp) backward qloop from bottom to top...
Gtest is implemented for Complex Contraction Bilinear. as far as I see, it is passing two test suite those are test_complex_contraction_bilinear and test_complex_contraction_bilinear_interface. ./test_complex_contraction_bilinear  ./test_complex_contraction_bilinear_interface 
@zjing14 This is a fully correct pipeline that support packed fp4 (2 `int4`s in a byte). This is used for demonstrate what might need to be changed to support subtype...