text-generation-inference
text-generation-inference copied to clipboard
How to make sure the local tgi server's performance is ok
Feature request
Hello, I just deployed the tgi server as docs in docker container on an single A100 and have a load test with bloom-7b1, but the performance has come a long way from other inference servers, like vllm, fastertransformer in the same environment & condition. So, if there is something like an official performance table for a beginner like me to make sure the performance is ok, or there are detailed instructions for me to check and set up some options to improve throughput. Thanks a lot!
Motivation
None
Your contribution
None
has come a long way from other inference servers
What do you mean ? Is it faster or slower ? I'm guessing slower but the phrasing isn't clear to me.
Usually using text-generation-benchmark --tokenizer-name xxxx
is our way of checking a given deployment.
What kind of numbers are you seeing ? How are you testing ?
Note: doing benchmarks in general is hard and it's easy to reach a wrong conclusion if you're not understanding what's going on.
Thanks for reply.
Yes, it's much slower than others when we testing on a single A100 environments with same dataset and load test script.
I also tested with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1:
with text-generation-benchmark --tokenizer-name bigscience/bloom-7b1 --decode-length 1024:
please give valuable suggestions, thanks in advance!
@Narsil Does text-generation-benchmark
also test continuous batching mentioned in router, since we could only set Decode Length
and Sequence Length
?
Its doesn't test it per-say as when continuous batching is active many things could be happening at the same time.
But every performance number is dominated by the number of tokens in the decode
phase, so this is really what you should be looking at.
@Narsil thanks
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.