Jack Butler issues

Results 14 issues of


                                            Jack Butler

Add tokenizer config classes

Fixes #990 I've just refactored out the tokenizer configuration details from the `get_tokenizer` function. This should hopefully; * Make it easier to add new tokenizers without changing logic in `get_tokenizer`....

Update inference documentation

I recently tried to setup the local inference server for testing different NLG models and found the installation documentation had some missing or incomplete information. We should update this so...

Update inference installation

fixes #1549. I've verified that I can setup the inference-server locally using each method with a fresh Python environment. * Updates installation option 2. with necessary requirements. * Pins `fastapi`...

Load test inference-server on different hardware

We want to test how many users the inference-server can serve and with what response times on setups with different numbers / types of GPUs & CPUs devices. On the...

testing

inference

Add distributed testing for inference server

# Overview We want to test the dockerised inference-server under different stress conditions such as; 1. Load testing - handling many concurrent of users 2. Latency testing - speed of...

testing

inference

Deploy distributed tests on a standardised compute setup

Deploy distributed tests on a standardised compute setup rather than running them locally.

testing

inference

Perform these distributed tests for different amounts of concurrent users

Perform these distributed tests for different amounts of concurrent users & fix any logical problems that arise from the stress tests

testing

inference

Expand the distributed tests to the `/work` and `/generate_stream`

Expand the distributed tests to the /work and /generate_stream and other endpoints in the application. We can potentially also look at endpoints outside of the inference-server

testing

inference

Expand to more complex user scenarios or different user profiles

i.e. We can look at scenarios where a user starts multiple short conversations vs one long continuous conversation.

testing

inference

Load test different models in the inference-server

We want to test the performance of different models within the inference server to understand how it scales with model size such as; * [distilgpt2](https://huggingface.co/distilgpt2) * [pythia-12B](https://huggingface.co/theblackcat102/pythia-12B-dedup-1000?text=My+name+is+Lewis+and+I+like+to)

testing

inference