TensorRT-LLM
TensorRT-LLM copied to clipboard
potential bug in mixed gemm kernel and scale iterator
In below codes, when scale row number, divide by 64: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h#L415
then when calculating tb row offset, mutiply by 64: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/transform/threadblock/fine_grained_scale_zero_iterator.h#L161
these two 64 consts are not needed, could you explain the reason behind this? and when group size is 32 in the future someday, divide by zero will happen