text-generation-inference chat API doesn't support/respect `n` parameter

chat API doesn't support/respect `n` parameter

Open zfang opened this issue 9 months ago • 3 comments

System Info

Docker container: ghcr.io/huggingface/text-generation-inference:3.0.0

Information

[x] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

Run tgi docker container

model="Qwen/Qwen2.5-72B-Instruct"
volume=/media/data_drive_0 # share a volume with the Docker container to avoid downloading weights every run
token=$HF_TOKEN
shards=4

docker run --gpus all --shm-size 1g \
  -e HUGGING_FACE_HUB_TOKEN=${token} \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -p 8084:80 \
  -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.0.0 \
  --model-id ${model} \
  --huggingface-hub-cache /data/.cache/huggingface/hub \
  --validation-workers ${shards} \
  --num-shard ${shards} \
  --sharded true \
  --max-input-length 32000 \
  --max-total-tokens 32768 \
  --rope-scaling dynamic \
  --rope-factor 1 \
  --cuda-memory-fraction 0.8 \
  --dtype bfloat16 \

Send request

$ curl localhost:8084/v1/chat/completions \                                                                                                                                                                                                                           
    -X POST \                                                                                                                                                                                                                                                                                                      
    -d '{                                                                                                                                                                                                                                                                                                          
  "model": "tgi",                                                                                                                                                                                                                                                                                                  
  "messages": [                                                                                                                                                                                                                                                                                                    
    {                                                                                                                                                                                                                                                                                                              
      "role": "system",                                                                                                                                                                                                                                                                                            
      "content": "You are a helpful assistant."                                                                                                                                                                                                                                                                    
    },                                                                                                                                                                                                                                                                                                             
    {                                                                                                                                                                                                                                                                                                              
      "role": "user",                                                                                                                                                                                                                                                                                              
      "content": "What is deep learning?"                                                                                                                                                                                                                                                                          
    }                                                                                                                                                                                                                                                                                                              
  ],                                                                                                                                                                                                                                                                                                               
  "stream": false,                                                                                                                                                                                                                                                                                                 
  "max_tokens": 20,                                                                                                                                                                                                                                                                                                
  "n": 3                                                                                                                                                                                                                                                                                                           
}' \                                                                                                                                                                                                                                                                                                               
    -H 'Content-Type: application/json'                                                                                                                                                                                                                                                                            
{"object":"chat.completion","id":"","created":1736981209,"model":"Qwen/Qwen2.5-72B-Instruct","system_fingerprint":"3.0.0-sha-8f326c9","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning, which in turn is a subset of artificial intelligence (AI"},"log
probs":null,"finish_reason":"length"}],"usage":curl localhost:8084/v1/chat/completions \0,"total_tokens":44}}

Expected behavior

The chat API is expected to return multiple responses when n > 1.

Jan 15 '25 22:01 zfang

text-generation-inference text-generation-inference copied to clipboard

chat API doesn't support/respect `n` parameter

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard