sockeye
sockeye copied to clipboard
translation speed with quantization into int8
Hi, I find the results in your paper that the translation speed with int 8 quantization can be 2 times faster than the fp32 model. However, I did not get the speed improvement when I run the translation with int8. Is there any suggestions or tutorials for follow-up?
The benchmarks in the paper run a WMT17 En-De big transformer with batch size 1 on a c5.2xlarge EC2 instance. Differences in any of these dimensions can lead to different speeds for FP32 and INT8 inference. The sockeye scripts in the arxiv_sockeye3 branch can be used to replicate the benchmarks from the paper.