CLBlast
CLBlast copied to clipboard
INT8 version of GEMM?
Hi
I am looking for a INT8 version of GEMM in OpenCL. If I am correct, CLBlast does not yet support it. Pls correct me if I am wrong and comment on the usage (perhaps a sample app etc.,).
Supposing INT8 variant is not yet present in CLBlast, have you come across any other works that you may recommend. I did run into this repo https://github.com/strin/gemm-android & then ARM's compute library https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/cl_kernels/gemm.cl
My goal is to extend my project https://github.com/sat8/YoloOCLInference to support INT8 models during inference. I have gathered few initial details on how to go about quantization from tensorflow https://www.tensorflow.org/performance/quantization and would like to implement it in my project but is in need of a INT8 version of GEMM. Tensorflow refers to https://github.com/google/gemmlowp which is a CPU & NEON optimized gemm, a CPU only library.
Any thoughts or comments would be appreciated.
I haven't done the research on INT8 yet, so I don't know of any other GEMM implementations with INT8.
Nevertheless, I think INT8 is an interesting topic for CLBlast. Having tackled FP16 already, I'd be willing to spend time on implementing such a feature, but I don't think it's easy, both on the host and device side many things will have to change going from floating-point to fixed-point. Also, what kind of hardware would you run this on? Hardware with native INT8 support? Does ARM Mali support this (given that it's in ARM's compute library)? Or do they pack 4 values together in a 32-bit integer? I'll have to read up on this topic a bit more in other to give you a proper answer.
thnx for the response.
Or do they pack 4 values together in a 32-bit integer? I think this may be true. You may want to check this out https://github.com/google/gemmlowp/blob/master/doc/quantization.md & reference code https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc
In tensorflow documentation, they highlight the range for mapping float to unsigned char based on experimentation
If my understanding is correct, INT8 is not a special datatype, rather it's just an unsigned char value. Of course with multiplication & other math ops, bit depth of more than 8 may be required as the output. Say, GEMM in INT8 may produce a 32 bit unsigned int.
Also, what kind of hardware would you run this on? I am thinking of using low precision GEMM on Asus tinkerboard that has Mali™-T764 GPU, AMD RX580 & GTX 1080Ti. At this point, I am not sure of the speedup factor that INT8 based inference could produce over pure floating point math but would like to validate it to know it better.
Hardware with native INT8 support? NVIDIA cards does seem to have instructions such as dp4a which could generate some speedup but I am unsure about where such instructions are exposed in OpenCL on any hardware. For now, I am aiming to compare FP32 vs INT8 deep learning inference supposing GEMM is in INT8 and I optimize my inference kernels using byte data. I would think doing so would certainly generate speedup as it is widely claimed by almost all hardware vendors. Any hardware native INT8 optimizations could come later in my project.
@sat8 https://github.com/naibaf7/caffe Has experimental int8 kernels for both CUDA and OpenCL if you're still interested to play with this.
Edit: I have to mention you'll not have the greatest time performance-wise. Turns out int8-FMAD is probably going to end up being int32-FMAD on AMD cards and the additional computations for quantization do cost as well. Especially shared memory and register costs. I haven't seen a DP4A equivalent on either AMD or Mali.
Do you have any update for this issue now? Or road map?, Thanks.
No, not really. Not sure if I will ever work on this, other things have priority. But contributors are free to work on this of course.
What hardware would you run it on? What use-case do you have?
Hi, I worked on one kind of miner algo, it needs batchs of size 256 by 256 int8 to int16 matrix multiplication. For nvidia cuda, already done, but amd opencl, seems not have a solution yet
As you do not have the plan, I think I will work it out myself.
Well, you could try naibaf7's implementation as mentioned above. But as he says, there is not much support for INT8 multiplications in hardware, so you'll probably won't gain much (or will actually lose) compared to FP32.
@CNugteren Thanks for you info. Very appreciate it.
INT8 GEMM is usually as s8s8s32. Like int c = (int8_t)a * (int8_t)b; The result use int, the input use int8_t