cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] Is there any INT8 GEMM with INT8 alpha and beta?

Open jhss opened this issue 1 year ago • 8 comments

What is your question?

using Gemm = cutlass::gemm::device::Gemm<
      int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor,
      ElementOutput, cutlass::layout::RowMajor, ElementAccumulator,
      cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75>;

cutlass::gemm::GemmCoord problem_size(M, N, K);

cutlass::TensorRef<ElementInputA, LayoutInputA> input_ref(input.data_ptr<int8_t>(), LayoutInputA::packed(input_size));
cutlass::TensorRef<ElementInputB, LayoutInputB> weight_ref(weight.data_ptr<int8_t>(), LayoutInputB::packed(weight_size));
cutlass::TensorRef<ElementOutput, LayoutOutput> out_ref(out.data_ptr<int8_t>(), LayoutOutput::packed(output_size));

typename Gemm::Arguments arguments{
      problem_size, // <- problem size of matrix multiplication
      input_ref,    // <- reference to matrix A on device
      weight_ref,   // <- reference to matrix B on device
      out_ref,      // <- reference to matrix C on device
      out_ref,      // <- reference to matrix D on device
      {alpha, beta}, 1};

In the code above, if I set alpha and beta as INT8, I got warning that narrowing conversion from int to float.

Does alpha and beta have to be float? I want to set it as INT8 to increase inference speed.

jhss avatar Oct 25 '23 03:10 jhss

just using 8 bit alpha/beta is not going to make performance difference.

hwu36 avatar Oct 25 '23 15:10 hwu36

@jhss is your question resolved?

mnicely avatar Dec 05 '23 17:12 mnicely

I want to know why 8 bit int alpha/beta doesn't effect performance

jhss avatar Dec 08 '23 01:12 jhss

because shaving off 4 bytes to 1 byte for a single load per tile does not change the perf at all. Changing fp32 multiplication to int8 will also not move the needle too much in the grand scheme of things.. What is your problem size you are most interested in?

thakkarV avatar Dec 08 '23 01:12 thakkarV

Thank you for answering.

I'm looking at smoothquant repository, they use matrix multiplication whose sizes are about (batch x 2048 x 768) * (768 x 768) in one layer. They perform matmul with torch_int, which use cutlass::eplilogue::thread::linear_combination at the end of matrix multiplication. By using this, they changed INT32 accumulator into float32 as follows:

weight_scale (alpha) * accumulator + bias_scale (beta) * bias

Suppose accumulator shape is (batch x 2048 x 768), then weight_scale shape is (2048, ), which is broadcasted and multiplied with accumulator. I'm just thinking that (1, 2048, 1) * (batch x 2048, 768) multiplication become faster when the type of weight_scale is changed from fp32 to int8.

jhss avatar Dec 08 '23 02:12 jhss

Although I doubt it, you can certainly try int8 alpha/beta to see if it would help in this case. What you would have to do is modify the epilogue thread functor's ElementCompute type and then use that to construct your epilogue. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination.h#L68

thakkarV avatar Dec 08 '23 02:12 thakkarV

@jhss is your question resolved?

mnicely avatar Jan 02 '24 15:01 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Feb 01 '24 16:02 github-actions[bot]