flux [QUESTION] M N K， N has no alignment problem, how should I deal with it?

cmd: $ CUDA_LAUNCH_BLOCKING=1 /vllm-workspace/flux/scripts/launch.sh /vllm-workspace/flux/test/test_ag_kernel_pyshmem.py 4096 72 512 --dtype=float16 --iters=10

May 07 '25 11:05 zfy3000163

@zfy3000163 the error msg tell you that your provided alignment is for operand 1 is not supported, thus the Gemm operation can not be performed. You should check your code where you assign the alignment.

May 07 '25 18:05 wenlei-bao

@wenlei-bao Thank you for your reply. May I ask if cutlass supports automatic padding and filling of unaligned matrices? I see that pytorch can run normally in the above case. Thanks！

May 08 '25 01:05 zfy3000163

code path: include\flux\cuda\gemm_impls\gemm_grouped_impl.hpp

// Parse template parameters static constexpr auto dt_conf = to_gemm_dtype_config(make_gemm_dtype_config(meta.dtype())); using ElementA = decltype(to_cutlass_element(dt_conf.a())); using ElementB = decltype(to_cutlass_element(dt_conf.b())); using ElementC = decltype(to_cutlass_element(dt_conf.c())); using ElementD = decltype(to_cutlass_element(dt_conf.d()));

using ArchTag = decltype(to_cutlass_archtag(meta.arch())); using ProblemShape = cute::Shape<int, int, int>;

using LayoutA = decltype(to_cutlass_layout_a(meta.gemm_layout())); static constexpr int32_t kAlignmentA = 8;

using LayoutB = decltype(to_cutlass_layout_b(meta.gemm_layout())); static constexpr int32_t kAlignmentB = 8;

using ElementOutput = ElementD; using LayoutOutput = decltype(to_cutlass_layout_c(meta.gemm_layout())); using ElementAccumulator = decltype(to_cutlass_element(dt_conf.acc()));

/// compose cutlass grouped gemm // using MmaOp = typename CustomCollectiveMmaBuilder<Element, ArchTag>::Mma; using MmaOp = cutlass::arch::OpClassTensorOp; using SmArch = cutlass::arch::Sm80; // TODO(houqi.1993)

default_gemm_kernel() const { using GemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped< ElementA, LayoutA, cutlass::ComplexTransform::kNone, kAlignmentA, ElementB, LayoutB, cutlass::ComplexTransform::kNone, kAlignmentB, ElementOutput, LayoutOutput, ElementAccumulator, MmaOp, SmArch, ShapeMmaThreadBlock, ShapeMmaWarp, ShapeMmaOp, EpilogueOp, SwizzleThreadBlock, NumStages, GroupScheduleMode::kDeviceOnly>::GemmKernel; return make_declval<GemmKernel>(); } Does the code above kAlignmentB solve this alignment problem? Thank you!

May 08 '25 03:05 zfy3000163

cmd: $ CUDA_LAUNCH_BLOCKING=1 /vllm-workspace/flux/scripts/launch.sh /vllm-workspace/flux/test/test_ag_kernel_pyshmem.py 4096 72 512 --dtype=float16 --iters=10

Sorry , I don't know FLUX has such a file called test/test_ag_kernel_pyshmem.py

Besides, 72 for N is too small, even if it runs, i suppose you enjoy no gain.

Jul 24 '25 08:07 houqi