text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

TGI drops requests when 150 requests are sent continuously at the rate of 5 Request Per Second in AMD 8 X MI300x with Llama 3.1 405B

Open Bihan opened this issue 1 year ago • 0 comments

System Info

TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm MODEL: meta-llama/Llama-3.1-405B-Instruct

Hardware used: Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2] AMD MI300X GPU OAM 192GB 750W GPUs [x8] 64GB RDIMM, 4800MT/s Dual Rank [x32]

Hardware provided by: hotaisle

Deployed using: dstack

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Steps to reproduce

  1. Using the above mentioned docker image provision the machine.
  2. RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 8192 --max-total-tokens 130000 --max-input-tokens 125000
  3. Clone the benchmarking repo to use benchmarking script by pulling the benchmarking repo.
  4. pip install aiohttp
  5. RUN python benchmark_serving.py --backend tgi --model meta-llama/Llama-3.1-405B-Instruct --dataset-name sonnet --sonnet-input-len 1000 --endpoint /generate_stream --dataset-path="sonnet.txt" --num-prompt=150 --request-rate=5

Expected behavior

ALL 150 requests should be successful.

Bihan avatar Oct 11 '24 09:10 Bihan