oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

Quantized matmul with unquantized bias

Open jelmervdl opened this issue 2 years ago • 8 comments

I'm trying to use the matmul primitive to multiply two quantized matrices (A:int8 & B:int8) into a C:float32, and want to add an unquantized bias:float32 to the output. To compensate for the quantisation, I use set_output_scales with 1.f / (A_scale * B_scale).

But this output scale is also applied to the bias? To get the expected output, I need to multiply my bias with A_scale * B_scale before the operation.

I.e. it looks like oneDNN does: C:float32 = Relu(Scaling:f32 * (A:int8 * B:int8 + Bias:float32)

Is there a way to do C:float32 = Relu(Scaling:f32 * (A:int8 * B:int8) + Bias:float32) instead?

What I've tried: I noticed that attributes.set_scales exist, which would allow me to set the scales for A and B individually. But this doesn't seem to be supported on matmul.

Another option I thought of was using a binary add post_op, as the documentation mentioned that these are applied after output scaling. But that doesn't seem to support dimensions specified at runtime yet? So that's also a no-go.

(To make matters more complicated, I'm using a bias vector as opposed to a matrix. This seems to work as expected though when I set the second dimension stride to 0.)

Playground: https://gist.github.com/jelmervdl/03632d158513b3f46925351cae9ad43f

jelmervdl avatar Apr 28 '22 10:04 jelmervdl

Hi @jelmervdl, thank you for your question.

Is there a way to do C:float32 = Relu(Scaling:f32 * (A:int8 * B:int8) + Bias:float32) instead?

Besides an option with binary_add, there are no other ways. As you noticed and mentioned, runtime dimensions are not supported for binary post-op. Applying scales for bias is the only way to proceed in this situation, unfortunately. Is there any issue (beside usability and potential overflow) that doesn't allow you to scale bias?

To make matters more complicated, I'm using a bias vector as opposed to a matrix. This seems to work as expected though when I set the second dimension stride to 0

I expect sandbox code still to work if bias is passed as {1, 3} dims and {1, 1} strides. Thank you.

dzarukin avatar May 07 '22 02:05 dzarukin

@dzarukin we are doing 8-bit neural network inference with intermediate fp32 results. We find that the bias term compresses poorly and this is why we keep it in fp32 format. Without the ability to scale the bias, we need to issue a second call oneDNN, which is quite a bit slower than doing everything at once. This is why my coworker asked if this functionality exists.

XapaJIaMnu avatar May 10 '22 08:05 XapaJIaMnu

Another option I thought of was using a binary add post_op, as the documentation mentioned that these are applied after output scaling. But that doesn't seem to support dimensions specified at runtime yet? So that's also a no-go.

Out of curiosity, why do you need the bias dimension to be specified at runtime? In general, one would expect bias to be a parameter trained during training and hence, should have fixed dimensions.

If you indeed have changing dimensions for bias, I would recommend to still use fixed dimensions primitives if you have a "small" number of different values for this dimension. Across several runs of the topology, the cost of primitive creation should be amortized by the use of the oneDNN primitive cache. Using fixed size dimensions has a few benefits:

  1. you will get the fastest implementation available (e.g. brg:matmul)
  2. you will be able to create weights with any, which will allow to reorder weights ahead of time. This typically allows substantial speedups.

Here are benchnn reproducers to highlight the abovementioned benefits (timings for 4 cores CLX): [1]

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_vnni,,--matmul --mode=P --cfg=u8s8f32 --stag=ab --wtag=ab --dtag=ab --attr-oscale=common:0.1* --attr-post-ops=add:f32:per_dim_0+relu 1x1024:1024x4096:1x4096,0.00838861,0.164795,50.9033,0.175746,47.7313
perf,cpu,gemm:jit,,--matmul --mode=P --cfg=u8s8f32 --stag=ab --wtag=ab --dtag=ab --runtime_dims_masks=1:0 --attr-oscale=common:0.1* --attr-post-ops=add:f32:per_dim_0+relu 1x1024:1024x4096:1x4096,0.00838861,0.219971,38.1351,0.239368,35.0447

Both primitives use the binary add postop followed by relu, and the bias added in binary add postop has a fixed size. The first line is with runtime dimension for M, but fixed dimension for K and N. The second line is with fixed dimensions for M, N and K. Here the primitive that is not using runtime dimension is ~1.4x faster.

[2]

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg:avx512_core_vnni,,--matmul --mode=P --cfg=u8s8f32 --stag=ab --dtag=ab --attr-oscale=common:0.1* --attr-post-ops=add:f32:per_dim_0+relu 1x1024:1024x4096:1x4096,0.00838861,0.078125,107.374,0.0863561,97.1397
perf,cpu,gemm:jit,,--matmul --mode=P --cfg=u8s8f32 --stag=ab --dtag=ab --runtime_dims_masks=1:0 --attr-oscale=common:0.1* --attr-post-ops=add:f32:per_dim_0+relu 1x1024:1024x4096:1x4096,0.00838861,0.220459,38.0507,0.229443,36.5608

Same as before, except that we use any layout for weights in both cases. Here, the primitive with fixed dimensions has blocked weights, and is ~2.7x faster..

mgouicem avatar May 10 '22 11:05 mgouicem

@mgouicem the heaviest computation for any neural network inference is the output layer which involves multiplication of the type 1x512 * 512x32000 + 1x32000. However the vast majority of those outputs are unlikely in NLP tasks and we can use various techniques to do so lexical shortlisting, that will dynamically for every mini-batch select a subset of elements of the very large parameter matrix 512x32000 and the bias, so that at every mini-batch we end up with pretty much unique problem size ranging from 512x100 to 512x10000 elements and this is why we would like to use runtime dimensions.

We could in theory only allow increments of 100, or so, but that seems inefficient, and we also have plenty of outliers.

Eg for some models the vocabulary size reaches 96000. Pre-generating granular primitives for those models would incur a hefty startup penalty, which we would like to avoid.

XapaJIaMnu avatar May 10 '22 13:05 XapaJIaMnu

Thanks for the clarification.

Ideally, oneDNN primitive cache should remove the need to pre-generate all possible shapes or the need to constraint your shapes. When the shapes are not known in advance, it should be fine to create the matmul primitive right before you run it if:

  • most of the time, the matmul primitives you create have the same shapes (if the distribution of shape for you matmul computation is uniformly distributed in {100, 10000}, then the suggestion does not work out).
  • you are ok to incur a latency penalty from time to time (a cache miss will incur full cost of primitive creation)

mgouicem avatar May 10 '22 14:05 mgouicem

Also it seems that the performance gap is not that huge between all those shapes (when using fixed shapes). When measuring matmul for the shapes you shared on a 16 core CLX system, it ranges from 0.08ms for 1x512:512x32000:1x32000 to 0.02ms for 1x512:512x500:1x500

mgouicem avatar May 10 '22 14:05 mgouicem

so that at every mini-batch we end up with pretty much unique problem size ranging from 512x100 to 512x10000 elements and this is why we would like to use runtime dimensions.

I'm curious to know if this kind of "dynamic" will propagate to other layers, for example, if there is a softmax after this fully connected layer? We know that oneDNN even doesn't support runtime dimensions for softmax primitive.

TaoLv avatar May 10 '22 14:05 TaoLv

@mgouicem Just to clarify, the above case where the leading dimension M is 1, is an idealised case where we care the most of about latency (mini-batch 1, beam size 1). In practise that 1 is the product of the mini-batch size (which is always dynamic depending on the sentence length for NLP tasks) and the beam search parameter (and slowly decreases as sentences complete/hypothesis fall out of the beam, although we usually pad with zeroes to avoid it being too dynamic). Fixing the mini-batch size is not ideal since we wouldn't know the length of the input in advance in real world situations. The bias is always 1xXXXX as it just gets broadcasted.

most of the time, the matmul primitives you create have the same shapes (if the distribution of shape for you matmul computation is uniformly distributed in {100, 10000}, then the suggestion does not work out).

Yes, it is mostly uniformly distributed.

@TaoLv , yes if we do a softmax (or argmax) post output layer, that would propagate to this fully connected layer.

Looking towards the future https://techcrunch.com/2022/03/22/microsoft-improves-its-ai-translations-with-z-code/ We are expected to use a mixture of experts, which further increases the potential combination of matrix shapes.

XapaJIaMnu avatar May 10 '22 15:05 XapaJIaMnu

oneDNN v3.0 introduced changes to quantization scheme that allow unquantized bias. See the full scope of changes in quantization and scaling RFC. You can test new functionality on rls-v3.0 branch.

vpirogov avatar Nov 24 '22 00:11 vpirogov