llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

Question about the speed of tiny-chat

Open benyang0506 opened this issue 2 years ago • 1 comments

Hi, Thanks for your outstanding work. I have see the results about tiny-chat using LLaMA2-7b. image I have reproduced LLaMA2-7b on a single A100, here is my results. image My best result is 20.89ms/token, can not reach the 12.44ms/token, and the A100 should be better than the A6000. I wonder what's
wrong.

benyang0506 avatar Aug 10 '23 09:08 benyang0506

@benyang0506 I have an explanation to offer you. The benchmarks provided by this repository are correct but they fail to mention the CPU used to measure the latency, which plays a big part. I am not exactly sure where the overhead occurs in Tinychat but it is confirmed to be CPU slowing down things, partly because it runs single-threaded.

For example, I was able to get the following:

  • RTX 4090 + Intel i9 13900K: 7.46-8.52ms
  • RTX 4090 + AMD EPYC 7-Series: 17.71-18.6ms
  • H100: 10.82ms

In essence, a larger GPU will not speed up the inference unless your single-threaded CPU can follow along.

EDIT: Also do note that the authors stated they are working on a new and improved version of TinyChat that should alleviate these kind of issues.

casper-hansen avatar Aug 11 '23 08:08 casper-hansen