text-generation-inference TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B

TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B

Open Bihan opened this issue 1 year ago • 0 comments

System Info

TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm MODEL: meta-llama/Llama-3.1-405B-Instruct

Hardware used: Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2] AMD MI300X GPU OAM 192GB 750W GPUs [x8] 64GB RDIMM, 4800MT/s Dual Rank [x32]

Hardware provided by: hotaisle

Deployed using: dstack

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Steps to reproduce

Using the above mentioned docker image provision the machine.
RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 8192 --max-total-tokens 130000 --max-input-tokens 125000
Clone the benchmarking repo to use benchmarking script by pulling the benchmarking repo.
pip install aiohttp
RUN python benchmark_serving.py --backend tgi --model meta-llama/Llama-3.1-405B-Instruct --dataset-name sonnet --sonnet-input-len 1000 --endpoint /generate_stream --dataset-path="sonnet.txt" --num-prompt=150 --request-rate=5

Expected behavior

ALL 150 requests should be successful.

Oct 11 '24 09:10 Bihan

text-generation-inference text-generation-inference copied to clipboard

TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard