FastChat
FastChat copied to clipboard
FP16 vs INT8 inference speed
trafficstars
Hi,
Amazing framework. Thanks a lot.
I was comparing the inference speed (seconds) of python -m fastchat.serve.cli with and without --load-8bit.
Both with the Vicuna 7B and 13B models, --load-8bit is a bit more than twice slower than fp16 inference on an NVIDIA A40 GPU (which has the third generation tensor core).
I'm trying to understand the reason behind this.
It seems like CLinear would have a lot of cast operations; could that lead to this amount of slowdown then matmul operations weren't too slow anyway on A40 GPUs?
Thanks.