llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Server: add function calling API [need investigation]

Open ngxson opened this issue 1 year ago • 1 comments

Motivation

This subject is already brought up in https://github.com/ggerganov/llama.cpp/issues/4216 , but my initial research failed.

Recently, I discovered a new line of model designed specifically for this usage: https://github.com/MeetKai/functionary

This model can decide whether to call functions (and which function to be called) in a given context. The chat template looks like this:

{#v2.2#}
{% for message in messages %}
  {% if message['role'] == 'user' or message['role'] == 'system' %}
    {{ '<|from|>' + message['role'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}
  {% elif message['role'] == 'tool' %}
    {{ '<|from|>' + message['name'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}
  {% else %}
    {% set contain_content='no'%}
    {% if message['content'] is not none %}
      {{ '<|from|>assistant\n<|recipient|>all\n<|content|>' + message['content'] }}
      {% set contain_content='yes'%}
    {% endif %}
    {% if 'tool_calls' in message and message['tool_calls'] is not none %}
      {% for tool_call in message['tool_calls'] %}
        {% set prompt='<|from|>assistant\n<|recipient|>' + tool_call['function']['name'] + '\n<|content|>' + tool_call['function']['arguments'] %}
        {% if loop.index == 1 and contain_content == \"no\" %}
          {{ prompt }}
        {% else %}
          {{ '\n' + prompt}}
        {% endif %}
      {% endfor %}
    {% endif %}
    {{ '<|stop|>\n' }}
  {% endif %}
{% endfor %}
{% if add_generation_prompt %}
  {{ '<|from|>assistant\n<|recipient|>' }}
{% endif %}

Example:

<|from|>system
<|recipient|>all
<|content|>// Supported function definitions that should be called when necessary.
namespace functions {
// Get the current weather
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string,
}) => any;
} // namespace functions
<|from|>system
<|recipient|>all
<|content|>A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary
<|from|>user
<|recipient|>all
<|content|>What is the weather for Istanbul?

Possible implementation

Since this is the only one model available publicly that can do this function, it's quite risky to modify llama_chat_apply_template to support it (we may end up pollute the code base).

The idea is to firstly keep the implementation in server example, then when the template become more mainstream, we can adopt it in llama_chat_apply_template.

Data passing in the direction from user ==> model (input direction)

  • [ ] Add function in server example to parse input request and format the prompt. Attention: with function calling, we will have 2 types of system messages: one for the actual prompt (You are a helpful assistant) and one for function definition.

Data passing in the direction from model ==> user (output direction)

  • [ ] Add grammar to for model to output JSON when it's inside function argument message
  • [ ] Add parser to extract function arguments and return it as JSON

ngxson avatar Feb 19 '24 13:02 ngxson

Research on MeetKai's implementation

My python snippet: https://gist.github.com/ngxson/c477fd9fc8e0a25c52ff4aa6129dc7a1

Key things to notice:

  • This implementation accepts OpenAI tool_calls format as input: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

  • Then, the OpenAI schema is converted into MeetKai schema (which is more compact and human-readable): https://github.com/MeetKai/functionary/blob/main/functionary/schema.py

  • The assistant response is in format (simplified version, see my python for more details):

    • Only response (no function call): <|from|>assistant + <|recipient|>all + message + <|stop|>

    • Response with one or multiple function calls: <|from|>assistant + <|recipient|>all + message + multiples times (<|recipient|>{{function_name}} + arguments) + <|stop|>

    • Reponse with tool_calls=none: <|from|>assistant + <|recipient|>no-tool-call + message + <|stop|>

    • Additionally, it also supports code interpreter, but it's too complicated to integrate for now: <|from|>assistant + <|recipient|>code-interpreter + ... + <|stop|>

  • MeetKai seems to have grammar-based sampling

  • Official example for the formatted prompt: https://github.com/MeetKai/functionary/blob/main/tests/prompt_test_v2.txt


Link to OAI docs for tool_calls: https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools

ngxson avatar Feb 19 '24 13:02 ngxson

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 08 '24 01:04 github-actions[bot]

I'm actually early and waiting on some things to finish through. I thought I would be busy until mid April, but I might have some free time sooner than I thought. I mention this because while some models are trained to use tools, I've noticed some models are smart enough to do it on their own with the right amount of solicitation.

  • https://github.com/teleprint-me/py.gpt.prompt/blob/main/docs/notebooks/llama_cpp_grammar_api.ipynb

I'm planning on implementing the proof of concept in more detail in a simplified and streamlined way.

  • https://github.com/teleprint-me/llama-cpp-client

There's also a fine-tuned mistral model trained to do this as well

  • https://huggingface.co/Trelis/Mistral-7B-Instruct-v0.2-function-calling-v3

I don't think it needs it, but it probably helps reduce the amount of necessary context to orientate it.

@abetlen Also has the functionary model

  • https://huggingface.co/abetlen/functionary-7b-v1-GGUF

I was "discussing" it with the Mistral 7B v0.2 model quantized to Q4_0 and it understood exactly what I wanted, but this was only after I provided it with the appropriate context. It did surprisingly well regardless.

The only reason I really care about this is because I want the models to have a "memory" via a SQLite database. It's something I've been working on for over a year because I genuinely do not like "RAG" which is just a Q & A with segmentation and language models. I never really liked it and always felt dissatisfied with it.

  • https://github.com/teleprint-me/py.gpt.prompt/blob/main/pygptprompt/function/memory.py

teleprint-me avatar Apr 08 '24 02:04 teleprint-me

Please Let me cast my humble vote in favour of this issue. It seems that agents capability is going to be the next big thing in LLMs. I mean, seriosly, chatting and RAG is supported by literally every possible toolkit, with its simplicity and limitations, but in order to keep up with the big tech the open source community must move on. Ok, enough talk.

My goal is to be able to run (at the very least) this: https://docs.llamaindex.ai/en/stable/examples/agent/openai_agent/ or this: https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb But with llama.cpp server as a backend. Directly or with a wrapper/adaptor. Currently it fails, obviously, with openai.InternalServerError: Error code: 500 - {'error': {'code': 500, 'message': 'Unsupported param: tools', 'type': 'server_error'}}

I am yet to explore @ngxson 's #5695 solution. It seems though, that it is geared towards MeetKai (Can anyone confirm this?) while we need a universal solution that can support llama-3, open ai, etc. interfaces. To my intermediate understanding the support boils down to a set of prompt templates appropriate for a particular model (Can anyone confirm this, too?). I am partucularly interested in llama-3-instruct model support.

I have found a similar solution that works with llama.cpp server (more or less), see: https://github.com/Maximilian-Winter/llama-cpp-agent Unfortunately, it is not compatible with llamaindex out of the box.

skoulik avatar Apr 30 '24 02:04 skoulik

To my intermediate understanding the support boils down to a set of prompt templates appropriate for a particular model

Yes that's correct. Function calling is simply just a more complicated chat template.

When I first started this PR, MeetKai was the only open-source model to implement this idea. Of course we have many new models now, but the problem is still the same with chat templates: there is no "standard" way, each model uses its own template.

Also because we're having more visibility now (i.e. more models to see the pattern), I'm planning to re-make all of this. Maybe as a dedicated side project - a wrapper for llama.cpp's server, because it will be quite messy. Then we will the if one day we can merge it back to llama.cpp

ngxson avatar Apr 30 '24 11:04 ngxson

Hi @ngxson , thank you for getting back. I've quickly skimmed through your commits and haven't found mentions of JSON to grammar conversions (https://github.com/ggerganov/llama.cpp/tree/master/grammars). (Or have I just missed it?) If this is the case, it is something worth exploring. Grammars that restrict models' output have shown to greatly increase the quality of function calling output (a random but relevant fact that I've learned googling around).

skoulik avatar May 01 '24 01:05 skoulik

@ngxson @skoulik https://github.com/ggerganov/llama.cpp/pull/6389

teleprint-me avatar May 01 '24 02:05 teleprint-me

@ngxson @skoulik #6389

This seems to be it. Great!

skoulik avatar May 01 '24 06:05 skoulik

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 16 '24 01:06 github-actions[bot]

These models support function calling (without fine-tuning):

  • ChatGLM3/GLM-4
  • Mistral v0.3
  • Qwen v1.5 & v2

For Qwen, function calling can be implemented outside the interference application.

I have implemented these in chatllm.cpp.

foldl avatar Jun 17 '24 11:06 foldl