FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

FP16 vs INT8 inference speed

Open jaywonchung opened this issue 2 years ago • 0 comments
trafficstars

Hi,

Amazing framework. Thanks a lot.

I was comparing the inference speed (seconds) of python -m fastchat.serve.cli with and without --load-8bit. Both with the Vicuna 7B and 13B models, --load-8bit is a bit more than twice slower than fp16 inference on an NVIDIA A40 GPU (which has the third generation tensor core).

I'm trying to understand the reason behind this. It seems like CLinear would have a lot of cast operations; could that lead to this amount of slowdown then matmul operations weren't too slow anyway on A40 GPUs?

Thanks.

jaywonchung avatar Jun 15 '23 03:06 jaywonchung