DeepShift
DeepShift copied to clipboard
Wonder the performance under the CPU with mixed assembly
Is it possible to applied mixed assembly techniques to DeepShift to achieve great enhancement (>=5x) on inference speed?
I believe to obtain a great enhancement like >=5x we need hardware that deals with variables of 4bits (to represent the weights), instead of dealing with variables of 8bits and having to do masking and bit extraction. Perhaps some embedded AI chips or embedded GPUs may have that.
Having said, there's still lots of room to speed up the CPU and CUDA kernels we have developed: fusing convolution with elementwise operations, use JIT compilers instead of 0re-built binaries, and tuning tiling size.
For CPU, we can look for instruction set architectures that have vector (i.e., SIMD) instructions for bitwise shifts and sign flips.