llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python)

Open ochafik opened this issue 3 months ago • 4 comments

Still very rough, but sharing a draft to get early feedback on the general direction.

This is an experiment in adding grammar-constrained tool support to llama.cpp, with a simple example of running agentic code on top, and support for sandboxing unsafe tools (e.g. Python interpreter).

Instead of bloating server.cpp any further, this slaps a Python layer in front of it to handle tool calling (partly because it's hard to do things well w/o proper jinja2 support - templates handle tool calling peculiarly at best, and partly because this could be a way to simplify the C++ server and focus it on performance and security rather than dealing with schemas and chat templates; WDYT?).

So this PR has a long way to go, but here's what can be done with it:

Show install instructions
  • Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
  • For Docker (to sandbox unsafe tools), consider lima (containerd / OSS) or Docker Desktop
git clone https://github.com/ochafik/llama.cpp --branch agent-example --single-branch llama.cpp-agent
cd llama.cpp-agent
conda create -n agent python=3.11
conda activate agent
pip install -r examples/agent/requirements.txt

make clean && make -j server
python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \
    --goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"
Show output
🛠️  add, divide, multiply, pow
💭 First, I will calculate the square of 2535, then add it to 32222000403. After that, I will multiply the result by 1.5 and finally, I will divide the result by 3.
⚙️  pow(value=2535, power=2) -> 6426225.0
💭 Now that I have calculated the square of 2535, I will calculate the sum of 6426225 and 32222000403.
⚙️  add(a=6426225, b=32222000403) -> 32228426628
💭 Now that I have calculated the sum, I will multiply it by 1.5.
⚙️  multiply(a=32228426628, b=1.5) -> 48342639942.0
💭 Now that I have calculated the product, I will divide it by 3.
⚙️  divide(a=48342639942.0, b=3) -> 16114213314.0
➡️  The result of the calculation is 16114213314.0.
python -c "print((2535**2 + 32222000403)*1.5 / 3)"
# 16114213314.0
python -m examples.agent \
    --tools examples/agent/tools/fake_weather_tools.py \
    --goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days." \
    --greedy
Show output
🛠️  get_current_weather, get_n_day_weather_forecast
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_current_weather(location=San Francisco, format=fahrenheit) -> ...
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_n_day_weather_forecast(location=San Francisco, format=fahrenheit, num_days=4) -> ...
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_n_day_weather_forecast(location=Glasgow, format=celsius, num_days=4) -> ...
The current weather in San Francisco is sunny and 87.8F. Here is the 4-day weather forecast:

For San Francisco:
- In 1 day: Cloudy, 60.8F
- In 2 days: Sunny, 73.4F
- In 3 days: Cloudy, 62.6F

For Glasgow:
- In 1 day: Cloudy, 16C
- In 2 days: Sunny, 23C
- In 3 days: Cloudy, 17C
python -m examples.agent --std-tools --goal "Say something nice in 1 minute."
Show output
🛠️  ask_user, say_out_loud, wait_for_date, wait_for_duration
💭 Thinking about what to say in the next minute.
⚙️  say_out_loud(something="In the next minute, I'll share a kind and uplifting message. Please wait...") -> None
💭 Waiting for the specified duration.
⚙️  wait_for_duration(duration={"seconds": 60}) -> None
💭 Thinking about what to say after the waiting period.
⚙️  say_out_loud(something="Thank you for your patience. Here's a nice message for you: 'A smile is the prettiest thing you can wear. So let your smile shine through.' - Dolly Parton") -> None
➡️ "The task of saying something nice in 1 minute is complete."

Add --verbose to see what's going on, and look at examples/agent/README & examples/openai/README for more details.

Tool sandboxing

Since tools can quickly become unsafe (don't want a rogue AI poking at your files), I've added a simple script to sandbox tools. It wraps a Python module as a REST server inside a Docker container exposing its port, and since it's using FastAPI it gives a neat OpenAPI schema that can be consumed by the agent code.

Run this in a separate terminal to get a sandboxed python interpreter (DATA_DIR will contain any files created by Python programs):

# Note: with limactl, the default sandbox location ~/.llama.cpp/sandbox won't be writable
# (see https://github.com/lima-vm/lima/discussions/393)
# export DATA_DIR=/tmp/lima/llama.cpp/sandbox
PORT=9999 examples/agent/run_sandboxed_tools.sh \
        examples/agent/tools/unsafe_python_tools.py

# INFO: using DATA_DIR: /Users/ochafik/.llama.cpp/sandbox
# ...
# INFO:     Uvicorn running on http://0.0.0.0:9999 (Press CTRL+C to quit)

Then tell the agent to discover tools at the new endpoint:

python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Whats cos(123) / 23 * 12.6 ?"
Show output
💭 To calculate the given mathematical expression, I will use the 'execute_python' tool.
⚙️  execute_python(source="import math\nresult = math.cos(123) / 23 * 12.6") -> {'result': -0.4864525314920599}
➡️ "-0.4864525314920599"
python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Create a file with 100k spaces"
Show output
💭 To accomplish the task of creating a file with 100k spaces, I need to consider how to generate and save this content. I will use the 'execute_python' tool to create a Python script that generates the spaces and writes them to a file.
⚙️  execute_python(source="with open('spaces_file.txt', 'w') as f:\n\tf.write(' ' * 100000)\n") -> {}
➡️ "File 'spaces_file.txt' has been created with 100,000 spaces."
wc -c ~/.llama.cpp/sandbox/spaces_file.txt
# 100000 ~/.llama.cpp/sandbox/spaces_file.txt
python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Write and run a program with a syntax error, then fix it"
Show output
💭 I need to create a Python program with a syntax error, run it, and then correct the error.
⚙️  execute_python(source="def hello:\\n    print('Hello, world')\\nheelo()") -> Error: expected '(' (<string>, line 1)
💭 I see that there is a syntax error in the code due to a typo. I need to correct the typo and run the code again.
⚙️  execute_python(source="def hello():\\n    print('Hello, world')\\nhello()") -> Error: unexpected character after line continuation character (<string>, line 1)
💭 It seems there is another syntax error in the corrected code. I need to investigate the issue and run the code again.
⚙️  execute_python(source="def hello():\n    print('Hello, world')\nhello()") -> {}
➡️ "Hello, world"

Everybody gets tool calling support!

Some models have been explicitly fine-tuned for tool usage (e.g. Functionary with temptative support in https://github.com/ggerganov/llama.cpp/pull/5695, or Hermes 2 Pro Mistral 7B which has a nice repo about it).

Some other models don't officially have support, at least in their OSS models... (Mixtral 👀)

But since https://github.com/ggerganov/llama.cpp/pull/5978, all can be coerced into sticking to a specific JSON schema.

This example supports the following tool prompting strategies in examples/openai/prompting.py (see dizzying combos of outputs):

  • --style=thoughtful_steps: the default unless Functionary template is detected.

    Constrains the output to JSON with the following TypeScript signature (which it advertises as JSON schema), which fully constrains all of the function arguments:

      {
        thought_about_next_step_only: string,
        next_step: (
          {result: T} |
          {
            tool_calls: (
              {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
              {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
              ...
            )[]
          }
        )
      }
      // Where T is the output JSON schema from the --format flag, or 'any'
    

    It seems quite important to give the model some space to think before it even decides whether it has the final output or needs extra steps (thought might work just as well, YMMV). Note that by default only 1 tool call is allowed, but for models that support parallel tool calling, you can pass --parallel-calls (Functionary does this well, but Mixtral-instruct tends to hallucinate)

  • --style=functionary_v2: besides using the proper template, this formats the signatures to TypeScript and deals with interesting edge cases (TODO: check whether this model has the only template that expects function call's arguments to be a json string, as opposed to a JSON object)

  • --style=short / long: announces tools in a <tool>...schemas...</tool> system call, and uses less constrained output that allows mixing text and <tool_call>{json}</tool_call> inserts.

    Since there is no negative lookahead (nor reluctant repetition modifier), I found it hard to write a grammar that allows "any text not containing "<tool_call>" then maybe <tool_call>. I settled for something a bit brittle (content := [^<] | "<" [^t<] | "<t" [^o<]), suggestions welcome!

  • --style=mixtral: OK now it gets weird. Mixtral works well w/ --style=thoughtful_steps (I just had to collapse system and tool messages into user messages as its chat template is very restrictive), but when prompted w/ You have these tools <tools>{json schemas}</tools>, it spontaneously calls tools with the semi-standard syntax used by Hermes too... except it has spurious underscore escapes 🤔

    Imma tell you what i'm doin'
    <tool\_call>
    {"arguments": ..., "name": "my\_weirdly\_escaped\_function\_name"}
    </tool\_call>`
    

    So in the mixtral style I just unescape underscores and we get a tool-calling Mixtral (style is otherwise much like long / short and would also benefit from more grammar features)

TODOs

  • [x] Auto discover tools exposed by an OpenAPI endpoint
  • [x] Fix / mitigate occasional "lost child process" issue
  • [x] Add a Python code tool + made sandbox to work
  • [ ] Turn Python tool to notebook / capture stdout / handle errors properly
  • [ ] Wait for spawned servers to be healthy
  • [ ] Add model URL / HF loading support
  • [ ] Send separate PR w/ (cleaner) JSON number length cap (context: I was asking Mixtral to divide by 3 and it multiplied by... 0.33333333333333...) → https://github.com/ggerganov/llama.cpp/pull/6555
  • [ ] Send separate PR w/ gguf utils (kv reading, lazy kv map, reader example update)
  • [ ] Get some general approval & integrate feedback
  • [ ] Measure token overhead of schemas & compress tool call syntax if possible (TS signatures borrowed from Functionary already help)
  • [ ] Stream all the things
  • [ ] Finish unit tests (wrap grammar parsing tester from python?)
  • [ ] Open discussion re/ grammar features: reluctant modifier maybe? Is it actually needed?
  • [ ] Server integration tests
  • [ ] Prepare code for review (doc strings)

ochafik avatar Mar 29 '24 20:03 ochafik

Thanks for the effort to bring this nice feature :1st_place_medal: . Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

phymbert avatar Mar 30 '24 06:03 phymbert

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8485.01ms p(95)=20500.73ms fails=, finish reason: stop=492 truncated=59
  • Prompt processing (pp): avg=93.53tk/s p(95)=407.01tk/s
  • Token generation (tg): avg=32.83tk/s p(95)=48.52tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=agent-example commit=2b2127c2a31d3456fd639b6a15bbb6b271920fea

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714621293 --> 1714621921
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 650.88, 650.88, 650.88, 650.88, 650.88, 701.32, 701.32, 701.32, 701.32, 701.32, 741.09, 741.09, 741.09, 741.09, 741.09, 794.36, 794.36, 794.36, 794.36, 794.36, 802.89, 802.89, 802.89, 802.89, 802.89, 800.54, 800.54, 800.54, 800.54, 800.54, 822.17, 822.17, 822.17, 822.17, 822.17, 829.28, 829.28, 829.28, 829.28, 829.28, 826.12, 826.12, 826.12, 826.12, 826.12, 839.09, 839.09, 839.09, 839.09, 839.09, 860.78, 860.78, 860.78, 860.78, 860.78, 861.49, 861.49, 861.49, 861.49, 861.49, 884.82, 884.82, 884.82, 884.82, 884.82, 903.65, 903.65, 903.65, 903.65, 903.65, 908.03, 908.03, 908.03, 908.03, 908.03, 906.35, 906.35, 906.35, 906.35, 906.35, 901.71, 901.71, 901.71, 901.71, 901.71, 895.11, 895.11, 895.11, 895.11, 895.11, 893.53, 893.53, 893.53, 893.53, 893.53, 898.27, 898.27, 898.27, 898.27, 898.27, 896.6, 896.6, 896.6, 896.6, 896.6, 899.78, 899.78, 899.78, 899.78, 899.78, 889.44, 889.44, 889.44, 889.44, 889.44, 889.84, 889.84, 889.84, 889.84, 889.84, 890.18, 890.18, 890.18, 890.18, 890.18, 884.28, 884.28, 884.28, 884.28, 884.28, 880.28, 880.28, 880.28, 880.28, 880.28, 879.12, 879.12, 879.12, 879.12, 879.12, 878.49, 878.49, 878.49, 878.49, 878.49, 876.75, 876.75, 876.75, 876.75, 876.75, 874.73, 874.73, 874.73, 874.73, 874.73, 872.11, 872.11, 872.11, 872.11, 872.11, 874.4, 874.4, 874.4, 874.4, 874.4, 879.57, 879.57, 879.57, 879.57, 879.57, 879.81, 879.81, 879.81, 879.81, 879.81, 874.91, 874.91, 874.91, 874.91, 874.91, 871.72, 871.72, 871.72, 871.72, 871.72, 871.19, 871.19, 871.19, 871.19, 871.19, 871.64, 871.64, 871.64, 871.64, 871.64, 873.47, 873.47, 873.47, 873.47, 873.47, 874.25, 874.25, 874.25, 874.25, 874.25, 863.96, 863.96, 863.96, 863.96, 863.96, 840.88, 840.88, 840.88, 840.88, 840.88, 840.04, 840.04, 840.04, 840.04, 840.04, 837.26, 837.26, 837.26, 837.26, 837.26, 837.69, 837.69, 837.69, 837.69, 837.69, 842.88, 842.88, 842.88, 842.88, 842.88, 842.36, 842.36, 842.36, 842.36, 842.36, 845.86, 845.86, 845.86, 845.86, 845.86, 836.45, 836.45, 836.45, 836.45, 836.45, 839.81, 839.81, 839.81, 839.81, 839.81, 843.42, 843.42, 843.42, 843.42, 843.42, 845.04, 845.04, 845.04, 845.04, 845.04, 849.19, 849.19, 849.19, 849.19, 849.19, 850.77, 850.77, 850.77, 850.77, 850.77, 850.7, 850.7, 850.7, 850.7, 850.7, 851.59, 851.59, 851.59, 851.59, 851.59, 853.35, 853.35, 853.35, 853.35, 853.35, 856.4, 856.4, 856.4, 856.4, 856.4, 856.75, 856.75, 856.75, 856.75, 856.75]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714621293 --> 1714621921
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 26.35, 26.35, 26.35, 26.35, 26.35, 29.37, 29.37, 29.37, 29.37, 29.37, 29.94, 29.94, 29.94, 29.94, 29.94, 32.45, 32.45, 32.45, 32.45, 32.45, 32.41, 32.41, 32.41, 32.41, 32.41, 33.5, 33.5, 33.5, 33.5, 33.5, 34.6, 34.6, 34.6, 34.6, 34.6, 34.66, 34.66, 34.66, 34.66, 34.66, 34.6, 34.6, 34.6, 34.6, 34.6, 34.54, 34.54, 34.54, 34.54, 34.54, 34.5, 34.5, 34.5, 34.5, 34.5, 34.65, 34.65, 34.65, 34.65, 34.65, 33.48, 33.48, 33.48, 33.48, 33.48, 32.99, 32.99, 32.99, 32.99, 32.99, 32.0, 32.0, 32.0, 32.0, 32.0, 32.08, 32.08, 32.08, 32.08, 32.08, 32.34, 32.34, 32.34, 32.34, 32.34, 31.91, 31.91, 31.91, 31.91, 31.91, 31.78, 31.78, 31.78, 31.78, 31.78, 31.51, 31.51, 31.51, 31.51, 31.51, 31.42, 31.42, 31.42, 31.42, 31.42, 31.44, 31.44, 31.44, 31.44, 31.44, 31.34, 31.34, 31.34, 31.34, 31.34, 31.39, 31.39, 31.39, 31.39, 31.39, 31.5, 31.5, 31.5, 31.5, 31.5, 31.56, 31.56, 31.56, 31.56, 31.56, 31.15, 31.15, 31.15, 31.15, 31.15, 30.86, 30.86, 30.86, 30.86, 30.86, 31.0, 31.0, 31.0, 31.0, 31.0, 31.13, 31.13, 31.13, 31.13, 31.13, 31.27, 31.27, 31.27, 31.27, 31.27, 31.54, 31.54, 31.54, 31.54, 31.54, 31.63, 31.63, 31.63, 31.63, 31.63, 31.63, 31.63, 31.63, 31.63, 31.63, 31.51, 31.51, 31.51, 31.51, 31.51, 31.46, 31.46, 31.46, 31.46, 31.46, 31.3, 31.3, 31.3, 31.3, 31.3, 31.36, 31.36, 31.36, 31.36, 31.36, 31.5, 31.5, 31.5, 31.5, 31.5, 31.6, 31.6, 31.6, 31.6, 31.6, 31.71, 31.71, 31.71, 31.71, 31.71, 31.55, 31.55, 31.55, 31.55, 31.55, 31.39, 31.39, 31.39, 31.39, 31.39, 30.98, 30.98, 30.98, 30.98, 30.98, 29.93, 29.93, 29.93, 29.93, 29.93, 29.7, 29.7, 29.7, 29.7, 29.7, 29.67, 29.67, 29.67, 29.67, 29.67, 29.74, 29.74, 29.74, 29.74, 29.74, 29.83, 29.83, 29.83, 29.83, 29.83, 29.93, 29.93, 29.93, 29.93, 29.93, 30.09, 30.09, 30.09, 30.09, 30.09, 30.03, 30.03, 30.03, 30.03, 30.03, 29.97, 29.97, 29.97, 29.97, 29.97, 29.93, 29.93, 29.93, 29.93, 29.93, 30.04, 30.04, 30.04, 30.04, 30.04, 30.19, 30.19, 30.19, 30.19, 30.19, 30.33, 30.33, 30.33, 30.33, 30.33, 30.44, 30.44, 30.44, 30.44, 30.44, 30.47, 30.47, 30.47, 30.47, 30.47, 30.5, 30.5, 30.5, 30.5, 30.5]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714621293 --> 1714621921
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.35, 0.35, 0.35, 0.35, 0.35, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.28, 0.28, 0.28, 0.28, 0.28, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.31, 0.31, 0.31, 0.31, 0.31, 0.35, 0.35, 0.35, 0.35, 0.35, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.31, 0.31, 0.31, 0.31, 0.31, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.32, 0.32, 0.32, 0.32, 0.32, 0.49, 0.49, 0.49, 0.49, 0.49, 0.61, 0.61, 0.61, 0.61, 0.61, 0.54, 0.54, 0.54, 0.54, 0.54, 0.41, 0.41, 0.41, 0.41, 0.41, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714621293 --> 1714621921
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    

github-actions[bot] avatar Apr 09 '24 10:04 github-actions[bot]

Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order. Branch agent-example-tmp will get most of the updates going forward (I might also experiment with skip-checks:true).

ochafik avatar Apr 10 '24 08:04 ochafik

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order.

Please forgive my comment, firstly because I am sure you do your best, then I personally pushed 60+ commits last 4 days ;) and now the CI is canceling concurrent jobs. Good luck!

phymbert avatar Apr 10 '24 09:04 phymbert

This is a bit off-topic, but I noticed in your example call:

python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \

Is there a particular reason you're using greedy sampling? If so, I think there is an opportunity for speedup when using grammars and greedy sampling, but I wasn't sure how frequently greedy sampling was used, so I haven't chased it down yet.

HanClinto avatar Apr 30 '24 17:04 HanClinto

Good day @ochafik , I know you are busy with important things, but still. I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api. It almost works as expected. The only glitch I've found so far is that llamaindex barks at me, because it expects function arguments to be string (JSON escaped? and converted to string) while your agent returns them as pure JSON. Is it something that is supposed to be confugurable? Or is it just a WIP? An omission?

According to the official OpenAI API, they produce string, too, look here, for instance: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_ujD1NwPxzeOSCbgw2NOabOin', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius",\n "num_days": 5\n}', name='get_n_day_weather_forecast'), type='function')]), internal_metrics=[{'cached_prompt_tokens': 128, 'total_accepted_tokens': 0, 'total_batched_tokens': 273, 'total_predicted_tokens': 0, 'total_rejected_tokens': 0, 'total_tokens_in_completion': 274, 'cached_embeddings_bytes': 0, 'cached_embeddings_n': 0, 'uncached_embeddings_bytes': 0, 'uncached_embeddings_n': 0, 'fetched_embeddings_bytes': 0, 'fetched_embeddings_n': 0, 'n_evictions': 0, 'sampling_steps': 40, 'sampling_steps_with_predictions': 0, 'batcher_ttft': 0.035738229751586914, 'batcher_initial_queue_time': 0.0007979869842529297}])

Might be something to do with extra security/agents isolation.

skoulik avatar May 06 '24 06:05 skoulik

Good day @ochafik , I know you are busy with important things, but still. I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api. It almost works as expected. The only glitch I've found so . ... Might be something to do with extra security/agents isolation.

Found expects_stringified_function_arguments ...

skoulik avatar May 06 '24 06:05 skoulik