sockeye icon indicating copy to clipboard operation
sockeye copied to clipboard

translation speed with quantization into int8

Open ZhenYangIACAS opened this issue 1 year ago • 1 comments

Hi, I find the results in your paper that the translation speed with int 8 quantization can be 2 times faster than the fp32 model. However, I did not get the speed improvement when I run the translation with int8. Is there any suggestions or tutorials for follow-up?

ZhenYangIACAS avatar Sep 22 '22 13:09 ZhenYangIACAS

The benchmarks in the paper run a WMT17 En-De big transformer with batch size 1 on a c5.2xlarge EC2 instance. Differences in any of these dimensions can lead to different speeds for FP32 and INT8 inference. The sockeye scripts in the arxiv_sockeye3 branch can be used to replicate the benchmarks from the paper.

mjdenkowski avatar Sep 22 '22 13:09 mjdenkowski