text-generation-inference tgi server :: tool_choice="auto" behaves like tool

System Info

tgi version : 2.3.0 model : Meta-Llama-3-8B-Instruct

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

0. tool definition to use for reproduction

weather_tool = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a specified city with specified measure",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, always seoul"
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use."
                }
            },
            "required": ["location", "format"]
        }
    }
}

1. Using OpenAI with tool_choice="auto"

api_key="[OPENAI_API_KEY]"

client = OpenAI(
    api_key=api_key
)

messages = [
    {
        "role": "system",
        "content": "You're a helpful assistant! Use tools if necessary",
    },
    {
        "role": "user",
        "content": "just respond with a warm greeting"
    }
]


chat_response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=[weather_tool],
    tool_choice="auto",
    stream=False
)
print(chat_response)

ChatCompletionMessage(content='Hello there! 🌞 How are you today?', refusal=None, role='assistant', function_call=None, tool_calls=None)

=> responds with normal chat message since prompt does not need tool_call

2. Using tgi with tool_choice="auto" (model = llama)

client = OpenAI(
    base_url="http://127.0.0.1:8080/v1/",
    api_key="dummy_key"
)

messages = [
    {
        "role": "system",
        "content": "You're a helpful assistant! Use tools if necessary",
    },
    {
        "role": "user",
        "content": "just respond with a warm greeting"
    }
]


chat_response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=[weather_tool],
    tool_choice="auto",
    stream=False
)
print(chat_response.choices[0].message)

ChatCompletionMessage(content=None, refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments={'format': 'celsius', 'location': 'Seoul'}, name='get_current_weather', description=None), type='function')])

=> tries to call a function anyway

3. Using OpenAI with tool_choice="required"

api_key="[OPENAI_API_KEY]"
client = OpenAI(
    api_key=api_key
)


messages = [
    {
        "role": "system",
        "content": "You're a helpful assistant! Use tools if necessary",
    },
    {
        "role": "user",
        "content": "just respond with a warm greeting"
    }
]


chat_response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=[weather_tool],
    tool_choice="required",
    stream=False
)
print(chat_response.choices[0].message)

ChatCompletionMessage(content=None, refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_0ZyaXEb9hIIQbJybYNlPjRVe', function=Function(arguments='{"location": "seoul", "format": "celsius"}', name='get_current_weather'), type='function')])

=> tries to call a function anyway

Expected behavior

When consuming tgi, I expect the server to be able to respond both with and without tool_call, when provided with tool definitions. As of now, application needs to be aware that tool calling is required before calling tgi, which In my opinion is not something LLM applications should aim for.

I am curious if the above behavior is intended. I have found that someone has raised this issue, (https://github.com/huggingface/text-generation-inference/pull/1587#issuecomment-1979185339) but it wasn't addressed anywhere.

Maybe something can be done with tool prompt, ToolType enum and chat_completions logic in server.rs ? If this behavior is not intended and needs fixing, I would love to give it a shot ! Thank you :)

Sep 23 '24 10:09 mottoslo

gentle ping @drbh is this issue being handled internally ? any feedback would be great !

Oct 21 '24 05:10 mottoslo

I am running into this issue as well. I am not knowledgable enough in rust to deal with this, but I would very much appreciate if you take this on @mottoslo !

Oct 24 '24 21:10 Simon-Stone

I am running into this issue as well. I am not knowledgable enough in rust to deal with this, but I would very much appreciate if you take this on @mottoslo !

I think handling this issue may involve (breaking) changes in feature and needs to be discussed beforehand, hence I do not know where to start. However, some pull requests have been opened since ( https://github.com/huggingface/text-generation-inference/pull/2645 https://github.com/huggingface/text-generation-inference/pull/2614 ... ) that I think are related to this, so I assume there's an internal consensus on how things should be done ?

Oct 25 '24 03:10 mottoslo

Either way, it would be a huge improvement. As it stands, we can't easily build agents based on models deployed with TGI because of this. At least not using the Messages API. I tried manually applying the chat template and using the generate endpoints, and the model appears to be able to choose not to use a tool. The downside of this approach is that the manual chat template handling makes it much harder to integrate in existing frameworks. Being able to use TGI as a drop-in replacement for OpenAI models would be fantastic.

Oct 25 '24 11:10 Simon-Stone

#2614 has been merged into main and is part of the latest release. Has anyone already had a chance to test if this solves the issue?

Nov 02 '24 03:11 Simon-Stone

Hey guys, seen the PRs related to this that were in the recent release. It doesn't look like this has fixed the issue for me. I'm using HuggingFaceEndpoint wrapped up inside ChatHuggingFace to try and do ToolCalling with langchain. I am finding that the LLM will always cool that one tool and never stop in order to respond when it has the information it needs. So, it appears to me that it's still behaving as if tool_choice='required'. Has anyone else had any success with this / is experiencing the same.

A general side note, I find using langchain for anything absolutely horrific but it's the easiest way out there. I would have preferred to suffer less and use the OpenAI classes from langchain to do this (rather than the chathuggingface, hfendpoint classes) but it seems that the API for TGI is not yet fully married with that which OpenAI uses. Cheers.

Nov 26 '24 14:11 Johnno1011

I guess this still hasn't been fixed? There are now multiple issues that all essentially come back to this.

Mar 05 '25 00:03 Simon-Stone

tgi server :: tool_choice="auto" behaves like tool_choice="required" from OpenAI spec

System Info

Information

Tasks

Reproduction

Expected behavior