text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

[Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8] Bad Responses with High Concurrent Requests

Open michaelact opened this issue 11 months ago • 3 comments

System Info

I'm using ghcr.io/huggingface/text-generation-inference:3.0.1 container image.

Issue Description

Hi everyone!

I'm using the Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 LLM model for benchmarking with multiple concurrent requests. However, when I send 10 concurrent requests, the responses start showing random characters, like the example below. It works fine with 3 or 5 concurrent requests, which give the best results.

n\nSumber:\n",
n\nSumber:\n",
n\nSumber:\n",
n\nSumber:\n",
        "<!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\nSumber:\n",
n\nSumber:\n",

Information

  • [X] Docker
  • [X] The CLI directly

Tasks

  • [X] An officially supported command
  • [X] My own modifications

Reproduction

Send 10 concurrent requests to the inference server.

docker-compose.yml

services:
  text-generation-inference-1:
    image: ghcr.io/huggingface/text-generation-inference:3.0.1
    container_name: text-generation-inference-1
    command: >
      --model-id Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8
      --api-key ${TGI_API_KEY}
      --max-total-tokens 8544
      --max-input-length 7520
      --trust-remote-code
      --num-shard 1
      --sharded false
      --max-top-n-tokens 1
      --max-best-of 1
      --max-stop-sequences 1
      --validation-workers 1
      --max-concurrent-requests 512
      --json-output
    security_opt:
      - label=disable
    environment:
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    volumes:
      - ./models:/data
      - ./pip-cache:/pip-cache
    env_file:
      - .env
    ipc: 'host'
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Expected behavior

When comparing with the meta-llama/Meta-Llama-3.1-8B-Instruct model using similar parameters, I get normal responses even with high concurrent requests. I expect that the Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 model should also handle high concurrency well without producing random characters.

I have also confirmed that the Qwen model performs well with high concurrency in vLLM.

Could anyone provide suggestions or experiments to improve the performance in high concurrency? Any help would be greatly appreciated! Thank you!

michaelact avatar Jan 09 '25 12:01 michaelact

same question, help !!!!!!!!!!!

hancheng19 avatar Jan 10 '25 02:01 hancheng19

I'm having the same issue with a r1-distilled-qwen-7b on h100s. Weirdly running it with version 3.0.1 works perfectly fine.

trofleb avatar Mar 17 '25 10:03 trofleb

I am having similar issue with TGI. Inference works great when I process single example at a time in SageMaker endpoint but passing multiple requests to handle load testing responses get destroyed... in my mind this is a batch padding issue because when I implement locally with my own collate fn it works perfectly?

I have tried on TGI 3.0.1 and 3.1.1

atastats avatar Mar 20 '25 21:03 atastats