wincnn icon indicating copy to clipboard operation
wincnn copied to clipboard

8 bit Winograd Convolution?

Open manojrohit opened this issue 7 years ago • 5 comments

Is it possible to implement Winograd Convolution with 8 bit weights and activations? The intermediate transformations cause overflows which results in the loss of accuracy of the overall CNN. Is anyone aware of research implementing Winograd in low precision domains?

manojrohit avatar Oct 10 '18 11:10 manojrohit

There are a couple of ways to think about this. I will assume you are using 8-bit integers and not 8-bit floating point numbers.

For deployment, the network weights are constant, so the winograd components can be computed offline in high precision, then quantized to 8-bits and stored. Because the Winograd components over-determine the raw weights, they actually contain more information than the raw weights.

The downside is that the winograd components use more memory than the raw weights. F(2x2, 3x3) filter transforms expands the raw weights by a factor of 1.78X, F(4x4, 3x3) by 4X.

Typically 8-bit activations are computed using full-precision multiplication and 32-bit accumulation, so that there is no precision loss during the computation. Then the 32-bit results are quantized to 8-bits before the next stage of computation.

So you could also apply the winograd transform to 32-bit activations before quantizing to 8-bits. If you were to do this, you would probably fuse the multiplications stage, inverse winograd transform, bias, activation, forward winograd transform, and quantization stages into a single operation.

The downside of this approach is that activations are stored in the winograd domain, which represents an expansion of the raw activations. The smaller the tile size, the bigger the expansion. F(2x2,3x3) expands raw activations by 4X, F(4x4,3x3) by 2.25X.

Another possibility is to quantize the activations to even less than 8-bit precision, so that when you perform the Winograd transform, the result uses no more than 8-bits. This probably works well in some applications at least, as there are research results showing accurate classification using low-precision activations.

Another possibility is to use 1-D Winograd transform, call it F(2x1, 3x3) or F(4x1, 3x3). This effectively turns a 2-D direct convolution into a 1-D direct convolution nested inside of a 1-D Winograd transform. The arithmetic complexity reduction is less, but so is the precision loss and activation and weight expansion. Also the computational intensity is higher, because the multiplications can be computed as matrix multiplications nested inside of a 1-D direct convolution. Additionally this might map to tensor core style arithmetic better than even 2-D direct convolution does. Also the 1-D Winograd transforms have even better data locality than the 2-D Winograd transforms, which are in turn better than the large tile FFT method.

As an aside, I would like to point out that the effect of locality on the minimum workspace size is missing from recent analyses of fast algorithms for convnets, even though our original publication exploited Winograd locality to fit the entire working set in the GPU's small shared memory space. Obviously small-tile convolutions make possible instruction schedules that have fewer cache misses than large-tile (FFT) convolution algorithms do.

I hope this gives you some ideas!

andravin avatar Oct 10 '18 17:10 andravin

@andravin @manojrohit Thank you very much for your advice.I have done int8 winograd F(2,3) in arm platform,In my implement,It is faster than fp32 winograd F(6,3) in most cases,and it is in a opensource project : ) ncnn int8 pr naive c code

BUG1989 avatar Feb 27 '19 02:02 BUG1989

Thank you for sharing the code @BUG1989. Any comments about accuracy degradation?

manojrohit avatar Feb 27 '19 08:02 manojrohit

@manojrohit In my project.The int8 winograd F(2,3) has the same accuracy as original int8 conv3x3s1

BUG1989 avatar Feb 27 '19 08:02 BUG1989

Another thing to try with int8 winograd is to quantize each of the winograd components separately.

This might be especially helpful when the input to the convolutional layer is the output of a ReLU activation. In that case, the input is nonnegative, so the winograd component with input transform [0,1,1,0] is also nonnegative. The other winograd components, with transforms [0,1,-1,0], [1,0,-1,0], and [0,-1,0,1], are signed with expected mean of zero (you might have a sign flip in any of these components depending on how you compute the transform).

You probably capture an extra bit of dynamic range if you map the [0,1,1,0] components to an unsigned int8 with range [0,255] and the other components to a signed int8 with range [-128,127]. You just have to be careful to scale the components appropriately when performing the inverse transform.

andravin avatar Feb 28 '19 19:02 andravin