[QUESTION] M N K, N has no alignment problem, how should I deal with it?
cmd: $ CUDA_LAUNCH_BLOCKING=1 /vllm-workspace/flux/scripts/launch.sh /vllm-workspace/flux/test/test_ag_kernel_pyshmem.py 4096 72 512 --dtype=float16 --iters=10
@zfy3000163 the error msg tell you that your provided alignment is for operand 1 is not supported, thus the Gemm operation can not be performed. You should check your code where you assign the alignment.
@wenlei-bao Thank you for your reply. May I ask if cutlass supports automatic padding and filling of unaligned matrices? I see that pytorch can run normally in the above case. Thanks!
code path: include\flux\cuda\gemm_impls\gemm_grouped_impl.hpp
// Parse template parameters static constexpr auto dt_conf = to_gemm_dtype_config(make_gemm_dtype_config(meta.dtype())); using ElementA = decltype(to_cutlass_element(dt_conf.a())); using ElementB = decltype(to_cutlass_element(dt_conf.b())); using ElementC = decltype(to_cutlass_element(dt_conf.c())); using ElementD = decltype(to_cutlass_element(dt_conf.d()));
using ArchTag = decltype(to_cutlass_archtag(meta.arch())); using ProblemShape = cute::Shape<int, int, int>;
using LayoutA = decltype(to_cutlass_layout_a(meta.gemm_layout())); static constexpr int32_t kAlignmentA = 8;
using LayoutB = decltype(to_cutlass_layout_b(meta.gemm_layout())); static constexpr int32_t kAlignmentB = 8;
using ElementOutput = ElementD; using LayoutOutput = decltype(to_cutlass_layout_c(meta.gemm_layout())); using ElementAccumulator = decltype(to_cutlass_element(dt_conf.acc()));
/// compose cutlass grouped gemm // using MmaOp = typename CustomCollectiveMmaBuilder<Element, ArchTag>::Mma; using MmaOp = cutlass::arch::OpClassTensorOp; using SmArch = cutlass::arch::Sm80; // TODO(houqi.1993)
default_gemm_kernel() const { using GemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped< ElementA, LayoutA, cutlass::ComplexTransform::kNone, kAlignmentA, ElementB, LayoutB, cutlass::ComplexTransform::kNone, kAlignmentB, ElementOutput, LayoutOutput, ElementAccumulator, MmaOp, SmArch, ShapeMmaThreadBlock, ShapeMmaWarp, ShapeMmaOp, EpilogueOp, SwizzleThreadBlock, NumStages, GroupScheduleMode::kDeviceOnly>::GemmKernel; return make_declval<GemmKernel>(); } Does the code above kAlignmentB solve this alignment problem? Thank you!
cmd: $ CUDA_LAUNCH_BLOCKING=1 /vllm-workspace/flux/scripts/launch.sh /vllm-workspace/flux/test/test_ag_kernel_pyshmem.py 4096 72 512 --dtype=float16 --iters=10
Sorry , I don't know FLUX has such a file called test/test_ag_kernel_pyshmem.py
Besides, 72 for N is too small, even if it runs, i suppose you enjoy no gain.