What does this PR do?

This PR adds automated load tests in CI using Grafana k6.

Two tests are performed:

Constant virtual users (VUs) load test: It simulates a fixed pool of users trying to make as many requests as possible on API during 60 seconds.
Constant arrival rate load test: It simulates a constant rate of user requests arrival, independent of the system’s response rate, during 60 seconds.

Both tests were run for 2 different kinds of inputs:

5000 ShareGPT prompts randomly selected (variable token length)
5000 ShareGPT prompts truncated to 500 tokens (constant token length). Tokens count usesllama-tokenizer

Test compute the following metrics:

Inter token latency: Time to generate a new output token for each user that is querying the system. It translates as the “speed” perceived by the end-user. We aim for at least 300 words per minute (average reading speed) so ITL<150ms
Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions, less so for offline workloads.
End to End latency: The overall time the system took to generate the full response to the user.
Throughput: The number of tokens per second the system can generate across all requests Successful requests: The number of requests the system was able to honor in the benchmark timeframe
Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.

At the end of the test, it produces charts with, on the same plot:

Results for TGI at current commit
Results for TGI at previous commit (if any)
Results for TGI at last release tag (if any)

Results are added to https://github.com/huggingface/text-generation-inference/issues/2235

It relies on run workflow artifacts to gather previous results (90 days TTL).

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[X] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jul 11 '24 09:07 Hugoch

🚀 Load test results are in:

Variable length prompts

Constant length prompts

Jul 12 '24 20:07 github-actions[bot]

🚀 Load test results are in:

Variable length prompts

Constant length prompts

Jul 12 '24 23:07 github-actions[bot]

One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse) don't have very good DX.

I think that's cool for now, but just good to be aware of 👍

Yeah, I agree. That would be easier to have everything containerized. But in that case we would need to mount the Docker socket in the container to be able to spawn TGI Docker from there or do some kind of Docker in Docker.

Jul 17 '24 08:07 Hugoch

My reviews were not sent for 1 month at least, nice :(

Aug 29 '24 14:08 Narsil

Closing as stale (more coming with new benchmarking tools !)

Oct 01 '24 14:10 Narsil

feat: Add load tests

What does this PR do?

Before submitting

Who can review?

Variable length prompts

Constant length prompts

Variable length prompts

Constant length prompts