text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

How to make sure the local tgi server's performance is ok

Open lichangW opened this issue 1 year ago • 5 comments

Feature request

Hello, I just deployed the tgi server as docs in docker container on an single A100 and have a load test with bloom-7b1, but the performance has come a long way from other inference servers, like vllm, fastertransformer in the same environment & condition. So, if there is something like an official performance table for a beginner like me to make sure the performance is ok, or there are detailed instructions for me to check and set up some options to improve throughput. Thanks a lot!

Motivation

None

Your contribution

None

lichangW avatar Jul 28 '23 07:07 lichangW

has come a long way from other inference servers

What do you mean ? Is it faster or slower ? I'm guessing slower but the phrasing isn't clear to me.

Usually using text-generation-benchmark --tokenizer-name xxxx is our way of checking a given deployment.

What kind of numbers are you seeing ? How are you testing ?

Note: doing benchmarks in general is hard and it's easy to reach a wrong conclusion if you're not understanding what's going on.

Narsil avatar Jul 28 '23 10:07 Narsil

Thanks for reply. Yes, it's much slower than others when we testing on a single A100 environments with same dataset and load test script. I also tested with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1: image

with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1 --decode-length 1024: image

please give valuable suggestions, thanks in advance!

lichangW avatar Jul 31 '23 03:07 lichangW

@Narsil Does text-generation-benchmark also test continuous batching mentioned in router, since we could only set Decode Length and Sequence Length?

ZhaiFeiyue avatar Sep 19 '23 07:09 ZhaiFeiyue

Its doesn't test it per-say as when continuous batching is active many things could be happening at the same time.

But every performance number is dominated by the number of tokens in the decode phase, so this is really what you should be looking at.

Narsil avatar Sep 19 '23 07:09 Narsil

@Narsil thanks

ZhaiFeiyue avatar Sep 19 '23 07:09 ZhaiFeiyue

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 20 '24 01:04 github-actions[bot]