chenwei
chenwei
support Layout::kFactor with 8 loading data from shared memory: |0 | 16 | 32 | 48 | 64 | 80 | 96 | 112| |-- | -- | -- |...
support weight only gemm with 2bit Note: This pr depends on two pull requests in cutlass repo: https://github.com/NVIDIA/cutlass/pull/1512 https://github.com/NVIDIA/cutlass/pull/1517
"cute/atom/copy_traits_sm90_tma.hpp" include "cute/algorithm/prefetch.hpp" "cute/algorithm/prefetch.hpp" include "cute/atom/copy_atom.hpp" "cute/atom/copy_atom.hpp" inlcude "cute/atom/copy_traits_sm90_tma.hpp" if nvcc version is greater than 12