text-generation-inference Function/tool calling never resolves

Description

When using the inference client with function calling, models seem to never resolve their calls.

As we know, typically, with the OpenAI pattern, the simplest function/tool call is a series of messages of various roles (system, user, assistant, tool) organized like this:

system → user ("what's the weather?") → assistant (tool_calls) → tool (result: "4ºC") → assistant (content: "it's 4ºC")

The HF docs seem to indicate this is the same pattern, although the messages have some minor differences (e.g. description: null, which never happens with OpenAI). When using the Python inference client, these tool_calls never resolve even after functions are called and their return values are included and seemingly properly referenced. Instead, they look like this:

system → user ("what's the weather?") → assistant (tool_calls) → tool (result: "4ºC") → assistant (tool_calls) …

Instead of returning a text completion, the HF inference client returns the same "assistant" message specifying a required tool_calls. In OpenAI, they resolve to a typical "assistant" message with token content if the function calls have been satisfied and no further calls are required.

Models used that exhibit this behavior:

NousResearch/Hermes-3-Llama-3.1-8B
Qwen/Qwen2.5-72B-Instruct
meta-llama/Meta-Llama-3-8B-Instruct

It's worth noting that Mistral models also error out, specifying that a 9-character alphanumeric string is required for the tool_call_id. Now, the models themselves don't provide such IDs, so we need to supply them ourselves. But even when doing so, the same error occurs, that 9-char identifiers are missing. (e.g. mistralai/Mistral-7B-Instruct-v0.3)

The JavaScript client also fails with the above errors, and also a third: "An error occurred while fetching the blob".

System Info

macOS 15.2
Python 3.13.1
huggingface_hub 0.28.1

Information

[ ] Docker
[x] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

Gist of sample error code is here: https://gist.github.com/awmartin/c64c84fbbdc3a9f0c2ce6e5ae0dab3dc

Provide API token
python inference-tool-calls.py

A message results that's unexpected. I expected this to be a typical message with a string content, something like, "It's 4 degrees today." Instead, it just repeats the assistant message with the original tool_call message:

[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content=None, tool_calls=[ChatCompletionOutputToolCall(function=ChatCompletionOutputFunctionDefinition(arguments={'unit': 'Celsius', 'location': 'Philadelphia, PA, US'}, name='get_current_temperature', description=None), id='0', type='function')]), logprobs=None)]

Expected behavior

I expected a message that resolved to something similar to "It's 4 degrees Celsius today" rather than the tool_call message repeated.

Feb 02 '25 18:02 awmartin

I think I opened this issue in the wrong repo. Moving it to here: https://github.com/huggingface/huggingface_hub/issues/2829

Feb 03 '25 18:02 awmartin

Reopening as I'm more convinced this is an error with the inference API and not the clients. All the clients (HF JS, HF PY, and OpenAI) fail in the same way.

Feb 03 '25 21:02 awmartin

@awmartin TGI's open AI API compatibility is still lacking compared to vllm.

Feb 05 '25 02:02 calycekr

@calycekr Thanks, I'll check it out!

My workaround for this bug(?) is to remove the "tools" definitions from the follow-up chat completion instance that supplies the tool responses/return values. It seems to work for now for short chats, but I suspect there are edge cases that will fail.

Related, I need to do this as well for vision models that accept OpenAI "image_url" messages. When supplying an image_url, tools are always triggered, seemingly randomly, even though the semantics of the prompt have nothing to do with the tool descriptions. Seems like another bug to report, but I'm not sure if HF's intent is to be OpenAI compatible or if the intent is to be able to provide prompts, images, and tools and have them triggered properly, in a more general or more HF-specific sense.

Feb 06 '25 04:02 awmartin

I suspect this is because the input message doesn't support tool_calls field, so the model don't know it already generated a tool_call response, so it returns tool_call again.

https://github.com/huggingface/text-generation-inference/blob/main/router/src/lib.rs#L1180

Feb 07 '25 16:02 LikeSundayLikeRain

Further description of the problem and my workaround here. These kinds of workarounds will work for simple cases, but when multiple tool calls are required or when images should trigger a tool call, as in OpenAI, they will likely fall short.

Feb 08 '25 17:02 awmartin

Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.

Feb 20 '25 17:02 qdrddr

@qdrddr Thanks. I do use LM Studio and MLX models, but I'm not blocked on getting tool calling working in general, I'm hindered by getting it working as well as OpenAI with HF specifically. HF's inference API appears to be broken, as @LikeSundayLikeRain may have found.

The app I'm building isn't macOS-specific, it's web-based, and it's intended to support OpenAI, HF, and arbitrary inference endpoints. So these suggestions certainly may work for local inference setups on macOS like mine, but I haven't yet tested tool calls on them as extensively yet.

But if the MLX implementation serves as a clue for how to help resolve this bug in HF, that's great. Tool behaviors are highly model-dependent, but this bug may hinder the correct behavior even if the model responds properly.

Feb 22 '25 16:02 awmartin

I am also running into this issue with the OpenAI client. Just tested with TGI 3.2.0, still the same problem that the model responds to the message with a tool call result with just another tool call.

Is this on anyone's radar?

Mar 13 '25 01:03 Simon-Stone

Same issue here. After we got excited that the tool call returns now a string, like OpenAIs API, we found that the model keeps calling the tool with the same input. As described by others, it seems that the model never gets the tool messages. Its worth mentioning that we are a step further already, the Tool Calls work as expected now and show up correctly.

Frameworks tested: Langgraph with ChatOpenAI, Langflow with OpenAI Component TGI Version: 3.2.0

Mar 13 '25 15:03 kteppris

Would be helpful to have a working tool call and not have to use a workaround with max retries and extra prompts in Langflow.

Mar 19 '25 10:03 jpschreiter

We completely switch to vLLM now and with some additional settings now Llama 3.3 Tool Calling works exactly as expected. I can imagine changing TGI backend to vLLM could potentially a solution for some than.

Apr 10 '25 08:04 kteppris

Yeap ran into this issue. TGI version 3.3.0 with phi-4 model

I switched the reply generation model to a serverless hosted model and it just worked.

May 13 '25 08:05 zacksiri

@zacksiri What exactly did you do to make it work?

May 14 '25 11:05 Simon-Stone

@Simon-Stone In my system I have the ability to run multiple LLMs.

I already tried the trick of separating tool calling from reply generation. So when tool calling supply the LLM with the list of tools and when doing reply generation remove all the tools. This work maybe for 1 - 2 turns, for some reason after a few turns the reply generation llm still asked for tool call as described in this thread even when I have not provided it any tools. (I did however put the "Tools Used" in context system message, maybe that's what git confused.

Anyway what I did was I ended up setting it up like this.

graph LR
    A[message in = nebius ai studio] --> B(tool call llm = tgi hosted)
    B --> C[reply genration = nebius ai studio]

I guess at least until TGI solves the issue.

May 15 '25 03:05 zacksiri

Not sure it's the same issue but it happens with 3.3.0 and llama3.3 instruct 70b Using 2 tools sometimes the response from the model (one of the tools only) is the same as the previous call

Tool { "type": "function", "function": { "name": "ask_patient, "description": "Ask patient a question, "parameters": { "type": "object", "properties": { "question": { "type": "string", "description": "The new question to ask", }, "reasoning": { "type": "string", "description": "Clinical reasoning for asking this specific question", }, }, "required": ["question", "reasoning"], }, }, },

Responses:

{'role': 'assistant', 'tool_calls': [{'id': '0', 'type': 'function', 'function': {'description': None, 'name': 'ask_patient', 'arguments': '{"question":"have you noticed any discharge, odor, or irregular bleeding associated with the itch?","reasoning":"The patient presents with a mild itch in the vaginal area. To further evaluate the cause of the itch, it is essential to inquire about any associated symptoms such as discharge, odor, or irregular bleeding, which could indicate conditions like a yeast infection or bacterial vaginosis."}'}}]}

Current:

{'role': 'assistant', 'tool_calls': [{'id': '0', 'type': 'function', 'function': {'description': None, 'name': 'ask_patient', 'arguments': '{"question":"Have you noticed any discharge, odor, or irregular bleeding associated with the itch?","reasoning":"The patient is experiencing an itch in the vaginal area, which could be indicative of various conditions such as yeast infections, bacterial vaginosis, or other infections. To further narrow down the diagnosis, it's essential to inquire about any additional symptoms like discharge, odor, or irregular bleeding."}'}}]}

The question is the same in the following model generation in the tool response, the reasoining which is the other tool is the different

May 31 '25 17:05 tsvisab

Is there any progress on this? Is it on the roadmap at all? Being able to use tool calling with models served by TGI would be incredibly useful.

Oct 09 '25 15:10 Simon-Stone