text-generation-inference Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message

Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message

Open leonarddls opened this issue 11 months ago • 0 comments

System Info

System: Linux 4.18.0-553.22.1.el8_10.x86_64 #1 SMP Wed Sep 25 09:20:43 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Rocky Linux 8.10

Hardware:

GPU: NVIDIA A100-SXM4-80GB CPU:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          23 
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             2250.000
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000

Using text-generation-inference docker containers with TGI 2.4.0, could not reproduce with TGI 2.3.1 The issue was reproduced with 2 models : mistralai/Mistral-7B-Instruct-v0.3 & google/gemma-2b-it

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Run a TGI container

# Podman
podman run --device nvidia.com/gpu=3 -v /data/huggingface/hub:/data -v /lib64/libcuda.so:/lib64/libcuda.so --shm-size 1G -p 0.0.0.0:8158:80 ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id /data/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/0de8392dcdf23d03ad5239108107dea08ea935ca --max-input-length 512 --max-total-tokens 1024 --max-batch-prefill-tokens 1024 --env --trust-remote-code
# Equivalent using docker
docker run --gpus '"device=3"' -v /data/huggingface/hub:/data -v /lib64/libcuda.so:/lib64/libcuda.so --shm-size 1G -p 0.0.0.0:8158:80 ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id /data/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/0de8392dcdf23d03ad5239108107dea08ea935ca --max-input-length 512 --max-total-tokens 1024 --max-batch-prefill-tokens 1024 --env --trust-remote-code

Make a stream chat completion request with a wrong argument that will error out the request before any generation occurs . Example : temperature = -1 or max_tokens > MAX_TOTAL_TOKENS

curl -X 'POST' \
  'http://localhost:8158/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_tokens": 32,
  "messages": [
    {
      "role": "user",
      "content": "What is Deep Learning?"
    }
  ],
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "stream": true,
  "temperature":-1
}'

Reponse code is 200 and the response body only contains

data: [DONE]

Expected behavior

The response code sould be 422 and body of the response should at least contain an error message. In TGI 2.3.1, this is the body returned for the same request on the same model :

data: {"error":"Input validation error: `temperature` must be strictly positive","error_type":"validation"}

data: [DONE]

The current 2.4.0 behaviour is a regression

Nov 15 '24 11:11 leonarddls

text-generation-inference text-generation-inference copied to clipboard

Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard