text-generation-inference
text-generation-inference copied to clipboard
Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message
System Info
System:
Linux 4.18.0-553.22.1.el8_10.x86_64 #1 SMP Wed Sep 25 09:20:43 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Rocky Linux 8.10
Hardware:
GPU: NVIDIA A100-SXM4-80GB
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 2250.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
Using text-generation-inference docker containers with TGI 2.4.0, could not reproduce with TGI 2.3.1 The issue was reproduced with 2 models : mistralai/Mistral-7B-Instruct-v0.3 & google/gemma-2b-it
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Run a TGI container
# Podman
podman run --device nvidia.com/gpu=3 -v /data/huggingface/hub:/data -v /lib64/libcuda.so:/lib64/libcuda.so --shm-size 1G -p 0.0.0.0:8158:80 ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id /data/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/0de8392dcdf23d03ad5239108107dea08ea935ca --max-input-length 512 --max-total-tokens 1024 --max-batch-prefill-tokens 1024 --env --trust-remote-code
# Equivalent using docker
docker run --gpus '"device=3"' -v /data/huggingface/hub:/data -v /lib64/libcuda.so:/lib64/libcuda.so --shm-size 1G -p 0.0.0.0:8158:80 ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id /data/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/0de8392dcdf23d03ad5239108107dea08ea935ca --max-input-length 512 --max-total-tokens 1024 --max-batch-prefill-tokens 1024 --env --trust-remote-code
- Make a stream chat completion request with a wrong argument that will error out the request before any generation occurs . Example : temperature = -1 or max_tokens > MAX_TOTAL_TOKENS
curl -X 'POST' \
'http://localhost:8158/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"max_tokens": 32,
"messages": [
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"stream": true,
"temperature":-1
}'
- Reponse code is 200 and the response body only contains
data: [DONE]
Expected behavior
The response code sould be 422 and body of the response should at least contain an error message. In TGI 2.3.1, this is the body returned for the same request on the same model :
data: {"error":"Input validation error: `temperature` must be strictly positive","error_type":"validation"}
data: [DONE]
The current 2.4.0 behaviour is a regression