text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Responses with unusual content.

Open ncthanhcs opened this issue 1 year ago • 5 comments

System Info

text generation inference api

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

i'm using inference api https://api-inference.huggingface.co/v1/chat/completions with nvidia/Llama-3.1-Nemotron-70B-Instruct-HF model.

I use the same message with the role of "user," and the model produces different results. Most of the time, the model provides normal answers, but occasionally it generates responses with strange content.
image

I temporarily stopped calling the API for a short period. After that, I called the API again with the same message used previously, and the model returned a normal response.

Expected behavior

Is this issue caused by the model? Is there any way to prevent the model from generating such strange responses?

ncthanhcs avatar Dec 30 '24 07:12 ncthanhcs

I experienced same behaviour with Inference APIs - when there are many parallel requests - model starts generating full rubbish. After restarting it works normally again, for me 32 parallel requests is max before model starts spitting out rubbish. This should not happen of course.

maiiabocharova avatar Jan 07 '25 16:01 maiiabocharova

I experienced the same issue with standard Llama models from Meta as well (3.1 70B Instruct, and 3.3 70B Instruct). These models are hosted in my corporate infrastructure and usually receive 3/4k requests (and 2/3M input tokens) per hour, which doesn't look to be that much. In fact, I've never seen more than 5 running requests per second for each model. I'm using TGI 3.0.1 with H100, and H100-nvl GPUs.

luonist avatar Jan 09 '25 13:01 luonist

Hi, in our case we are experimenting the same issue with Llama 3.3 70B Instruct model. Giving more details: We have setup two endpoints of the same model, each endpoint has the exact same infrastructure setup each.

Runtime environment:

  • Kubernetes Cluster deployment
  • 4 A100 GPU with 80GB RAM
  • 12 CPU with 32 GB RAM each
  • TGI version: 3.0.0

TGI Config

/info output: (both endpoints have the exact same output)

{
  "model_id": "/model_data/llama3-3-70b",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_tokens": 131071,
  "max_total_tokens": 131072,
  "validation_workers": 4,
  "max_client_batch_size": 4,
  "router": "text-generation-router",
  "version": "3.0.0",
  "sha": "8f326c97912a20d20cddb6ba61bf1569fa9a8601",
  "docker_label": "sha-8f326c9"
}

What we are observing

We have two endpoints with the exact same configuration. We call both endpoints at /v1/chat/completions just filling in the messages parameter, and leaving all other parameters with their default values.

At first, the two models were behaving normally. But after a certain point in time, one of them started to produce strange outputs (see "output example"): a lot of nonsense, until it reaches the finish reason "length", while the other produces normal outputs for the same calls (same input, same parameters, same seed).

Our hypothesis

One of the endpoints, usually has more load (concurrent requests) than the other. Whenever it seems to be handling concurrent requests (less than 30 concurrent requests), with long context window (big number of input tokens, more than 40/50K input tokens), something in TGI seems to start malfunctioning, and the unexpected responses start appearing until the pod is restarted, regardless of the subsequent load or input size of the calls made to the malfunctioning model.

Input/Output example: Two cases, one with a simple input and weird repeating output, and another example with gibberish, both abnormal behaviors. Image Image

andresC98 avatar Jan 29 '25 18:01 andresC98

Update: we have tried with TGI 3.0.1, as well as 3.1.0; and changing dtype to bfloat16 and the issue persists. We suspect some kind of KV stale could be going on? The rest of our TGI ENV config is default, besides these variables:

extraInferenceEnvs:
  MAX_BATCH_PREFILL_TOKENS: "4096"
  PREFILL_CHUNKING: "1"
  DTYPE: "bfloat16"

andresC98 avatar Feb 19 '25 09:02 andresC98

Any chance you could test TGI 3.1.1? We fixed two prefix caching edge cases that can lead to long-term corruption.

danieldk avatar Mar 05 '25 08:03 danieldk