text-generation-inference
text-generation-inference copied to clipboard
chat API doesn't support/respect `n` parameter
System Info
Docker container: ghcr.io/huggingface/text-generation-inference:3.0.0
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
Run tgi docker container
model="Qwen/Qwen2.5-72B-Instruct"
volume=/media/data_drive_0 # share a volume with the Docker container to avoid downloading weights every run
token=$HF_TOKEN
shards=4
docker run --gpus all --shm-size 1g \
-e HUGGING_FACE_HUB_TOKEN=${token} \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
-p 8084:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.0.0 \
--model-id ${model} \
--huggingface-hub-cache /data/.cache/huggingface/hub \
--validation-workers ${shards} \
--num-shard ${shards} \
--sharded true \
--max-input-length 32000 \
--max-total-tokens 32768 \
--rope-scaling dynamic \
--rope-factor 1 \
--cuda-memory-fraction 0.8 \
--dtype bfloat16 \
Send request
$ curl localhost:8084/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": false,
"max_tokens": 20,
"n": 3
}' \
-H 'Content-Type: application/json'
{"object":"chat.completion","id":"","created":1736981209,"model":"Qwen/Qwen2.5-72B-Instruct","system_fingerprint":"3.0.0-sha-8f326c9","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning, which in turn is a subset of artificial intelligence (AI"},"log
probs":null,"finish_reason":"length"}],"usage":curl localhost:8084/v1/chat/completions \0,"total_tokens":44}}
Expected behavior
The chat API is expected to return multiple responses when n > 1.