cutlass
cutlass copied to clipboard
[QST] Is there any INT8 GEMM with INT8 alpha and beta?
What is your question?
using Gemm = cutlass::gemm::device::Gemm<
int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor,
ElementOutput, cutlass::layout::RowMajor, ElementAccumulator,
cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75>;
cutlass::gemm::GemmCoord problem_size(M, N, K);
cutlass::TensorRef<ElementInputA, LayoutInputA> input_ref(input.data_ptr<int8_t>(), LayoutInputA::packed(input_size));
cutlass::TensorRef<ElementInputB, LayoutInputB> weight_ref(weight.data_ptr<int8_t>(), LayoutInputB::packed(weight_size));
cutlass::TensorRef<ElementOutput, LayoutOutput> out_ref(out.data_ptr<int8_t>(), LayoutOutput::packed(output_size));
typename Gemm::Arguments arguments{
problem_size, // <- problem size of matrix multiplication
input_ref, // <- reference to matrix A on device
weight_ref, // <- reference to matrix B on device
out_ref, // <- reference to matrix C on device
out_ref, // <- reference to matrix D on device
{alpha, beta}, 1};
In the code above, if I set alpha and beta as INT8, I got warning that narrowing conversion from int to float
.
Does alpha and beta have to be float? I want to set it as INT8 to increase inference speed.
just using 8 bit alpha/beta is not going to make performance difference.
@jhss is your question resolved?
I want to know why 8 bit int alpha/beta doesn't effect performance
because shaving off 4 bytes to 1 byte for a single load per tile does not change the perf at all. Changing fp32 multiplication to int8 will also not move the needle too much in the grand scheme of things.. What is your problem size you are most interested in?
Thank you for answering.
I'm looking at smoothquant repository, they use matrix multiplication whose sizes are about (batch x 2048 x 768) * (768 x 768)
in one layer. They perform matmul with torch_int, which use cutlass::eplilogue::thread::linear_combination
at the end of matrix multiplication. By using this, they changed INT32 accumulator into float32 as follows:
weight_scale (alpha) * accumulator + bias_scale (beta) * bias
Suppose accumulator shape is (batch x 2048 x 768)
, then weight_scale
shape is (2048, ), which is broadcasted and multiplied with accumulator. I'm just thinking that (1, 2048, 1) * (batch x 2048, 768) multiplication become faster when the type of weight_scale
is changed from fp32 to int8.
Although I doubt it, you can certainly try int8 alpha/beta to see if it would help in this case. What you would have to do is modify the epilogue thread functor's ElementCompute
type and then use that to construct your epilogue.
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination.h#L68
@jhss is your question resolved?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.