xiaonans
xiaonans
Currently some quantized huggingface models save zero-points in int4 datatype directly, like [Qwen/Qwen2-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4) and [Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-AWQ). But the weight_only_groupwise_quant_matmul in TensorRT-LLM only support fp16 zero-points as input, thus...
**What is your question?** I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas....
Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major....
**Describe the bug** I modified the block/warptile shapes and the output datatype in https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sm80.cu, and found some shapes cause the tests to fail. I modified the ElementOutput to cutlass::half_t and...