server
server copied to clipboard
Problems with the response of the OpenAI-Compatible Frontend for Triton Inference Server
Hi,
i have installed Triton with vllm backend and also the OpenAI-Compatible Frontend for Triton Inference Server (Beta). The model is meta-llama/Llama-3.1-8B-Instruct. Now when I call the Endpoint for example like this:
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Why is the sky blue?"}]
}' | jq
The Response is:
{
"id": "cmpl-276d1a84-a293-11ef-b088-d404e69cb4ea",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The sky appears blue because of a phenomenon called Rayleigh scattering, named after the",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1731593837,
"model": "vllm_model",
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
}
As you can see the content is cropped. I have played with the config but I don't know what's the problem. With Python the response is fine:
from openai import OpenAI
client = OpenAI(
base_url="http://192.168.175.242:9000/v1",
api_key="EMPTY",
)
model = "vllm_model"
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": "Why is the sky blue?"},
],
max_tokens=4096,
)
print(completion.choices[0].message.content)
The response here is:
The sky appears blue due to a phenomenon called Rayleigh scattering. This is a scientific explanation:
-
Sunlight and the Atmosphere: When sunlight enters the Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light.
-
Scattering of Light: According to the Rayleigh scattering theory, when light travels through the atmosphere, it encounters these tiny molecules. The shorter (blue) wavelengths of light are scattered more than the longer (red) wavelengths. This scattering of light in all directions is what gives the sky its blue color.
-
Blue Light Dominates: Due to the scattering effect, the blue light is distributed throughout the atmosphere, reaching our eyes from all directions. As a result, the sky appears blue. This is why we see a blue sky during the daytime.
-
Time of Day and Atmospheric Conditions: The color of the sky can change depending on the time of day and atmospheric conditions. During sunrise and sunset, the light has to travel longer distances through the atmosphere, which scatters the shorter wavelengths even more, making the sky appear red or orange. On a cloudy day, the scattered light is blocked, making the sky appear gray or white.
In summary, the sky appears blue due to the scattering of sunlight by the tiny molecules in the atmosphere, with blue light being scattered more than other colors.
My model.json is:
{
"model":"meta-llama/Llama-3.1-8B-Instruct",
"disable_log_requests": true,
"gpu_memory_utilization": 0.9,
"enforce_eager": true,
"tensor_parallel_size": 4,
"max_model_len": 50000
}
I have the same problem with 4096 max_model_len.
It would be great if someone can help me here.
Hardware: 4 GPUS NVIDIA L4 with 96 GB VRAM.
Thanks 👍
Hi @DimadonDL,
The main difference with your curl request appears to be the lack of setting max_tokens. I believe the current default max_tokens is 16 (example from vllm) if left unspecified, which is probably why you're getting such a short response. Can you try setting max_tokens?
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"max_tokens": 4096
}' | jq
Hi @rmccorm4,
Thanks for the explanation. That makes sense. Is it possible to override the default in vllm. I have an application where I can not set the max_token.
Thanks for your help so far. ☺️👌
Hi y'all, I wanted to provide a little more weight towards defaulting this to something different.
First, that max_tokens parameter is deprecated in the OpenAI API. I guess it was changed to max_completion_tokens ¯\_(ツ)_/¯
But, also, when I use the same model on the base vLLM service, it does not stop producing tokens after 16 tokens. I'll dig through the code a bit later today and see if I can grok how this is happening on their end.
Ahhh, I was able to find this easier once I looked for max_completion_tokens over in the vLLM repo. This is their default. Would it be possible to make this the default for this frontend also?
default_max_tokens = self.max_model_len - len(engine_prompt["prompt_token_ids"])
hi, does triton inference also work with openai frontend when i use python backend for triton inference server?