llm-awq
llm-awq copied to clipboard
Question about the speed of tiny-chat
Hi,
Thanks for your outstanding work. I have see the results about tiny-chat using LLaMA2-7b.
I have reproduced LLaMA2-7b on a single A100, here is my results.
My best result is 20.89ms/token, can not reach the 12.44ms/token, and the A100 should be better than the A6000. I wonder what's
wrong.
@benyang0506 I have an explanation to offer you. The benchmarks provided by this repository are correct but they fail to mention the CPU used to measure the latency, which plays a big part. I am not exactly sure where the overhead occurs in Tinychat but it is confirmed to be CPU slowing down things, partly because it runs single-threaded.
For example, I was able to get the following:
- RTX 4090 + Intel i9 13900K: 7.46-8.52ms
- RTX 4090 + AMD EPYC 7-Series: 17.71-18.6ms
- H100: 10.82ms
In essence, a larger GPU will not speed up the inference unless your single-threaded CPU can follow along.
EDIT: Also do note that the authors stated they are working on a new and improved version of TinyChat that should alleviate these kind of issues.