text-generation-inference
text-generation-inference copied to clipboard
TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B
System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct
Hardware used: Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2] AMD MI300X GPU OAM 192GB 750W GPUs [x8] 64GB RDIMM, 4800MT/s Dual Rank [x32]
Hardware provided by: hotaisle
Deployed using: dstack
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Steps to reproduce
- Using the above mentioned docker image provision the machine.
- RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 8192 --max-total-tokens 130000 --max-input-tokens 125000
- Clone the benchmarking repo to use benchmarking script by pulling the benchmarking repo.
- pip install aiohttp
- RUN python benchmark_serving.py --backend tgi --model meta-llama/Llama-3.1-405B-Instruct --dataset-name sonnet --sonnet-input-len 1000 --endpoint /generate_stream --dataset-path="sonnet.txt" --num-prompt=150 --request-rate=5
Expected behavior
ALL 150 requests should be successful.