llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature request: documented machine-readable output as a top tier feature

Open simonw opened this issue 2 years ago • 15 comments

I'd like to start building software on top of the llama.cpp family of tools, using JavaScript (Deno) and Python and potentially other languages too.

I don't want to mess around with C bindings for those languages. I would much rather fire up a subprocess and communicate with it from my program over stdin/stdout/stderr.

A documented input and output protocol for doing this would be amazing.

Some ideal characteristics:

  • Lets me keep the model loaded in memory/on the GPU, such that I can send it a new prompt without paying a startup cost each time
  • Supports streaming, so I can stream tokens back to my user as they become available
  • Stable and documented, so I rarely need to rewrite my calling code to adapt to new features
  • Supports "chat" mode where appropriate, though this is far less important to me personally than the other features

simonw avatar Jun 07 '23 09:06 simonw

Here's a suggested design for this. I really like newline-delimited JSON as a way of sending data into and out of a program via standard input and output - it's a really easy format to work with, since each interaction just involves reading or writing a line of text to the process.

So how about this: you start the model running using something like this:

./main -m ./models/7B/ggml-model-q4_0.bin --jsonl

Note the --jsonl option to put it in machine readable newline-delimited JSON mode.

Then to send it a prompt you write a JSON object as a single line to standard input of the process like this:

{"prompt": "Names for a pet pelican", "temp": 0.5, "repeat-penalty": 1.5, "n": 512}

This format can be reduced to just {"prompt": "Names for a pet pelican"}\n to use default settings, and can be expanded to cover dozens of other options.

Then the results start streaming back via standard output. The stream looks like this:

{"content": "Here"}
{"content": " are"}
{"content": " some"}
{"content": " names"}
{"content": " for"}
...
{"end": true}

As you can see, each output token gets its own newline-delimited object, and a special {"end": true} indicates the end of the response.

(Alternative design: the last token in the stream could be {"content": " enjoy!", "end": true} with an extra "end" key.)

Additional information such as logit scores could be incorporated into these output objects as well, maybe controlled by extra options sent along with the prompt.

simonw avatar Jun 07 '23 09:06 simonw

Here's an example of the kind of thing I've built using this pattern in the past: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/

That runs ripgrep in a process - which provides a --json mode: https://github.com/simonw/datasette-ripgrep/blob/e3b2bb937380170ea729d9ae0abfb95a7002a70e/datasette_ripgrep/init.py#L9

simonw avatar Jun 07 '23 10:06 simonw

You may be insterested in the API improvements that are pending: #1570

SlyEcho avatar Jun 07 '23 10:06 SlyEcho

You may be insterested in the API improvements that are pending: #1570

...sure ... and stuck now ....

mirek190 avatar Jun 07 '23 11:06 mirek190

I think a stdin/stdout mode would still be useful even with an available HTTP API server.

I also want to be able to run prompts from other CLI tools - such as https://github.com/simonw/llm - without needing to start a web server running on an available port, run a prompt through it and then stop the server again afterwards.

Any JSON-over-stdin/stdout mode should use the same JSON design as the server does as much as possible, for consistency - eg to match this:

https://github.com/ggerganov/llama.cpp/blob/df2ecc942a824d5a11cdd6d3083915f28ab24628/examples/server/README.md?plain=1#L46-L48

simonw avatar Jun 07 '23 12:06 simonw

Yeah, sure. What we discovered is that the code is much simpler, actually, when it doesn't have to deal with reading user input from the terminal.

I don't know how hard it would be but maybe it's possible to run the server on a socket instead? Keeping the server running has the advantage that the model is kept in memory and it lowers the amount of initialization that needs to be done.

Maybe there could be a core that runs the JSON commands that both the server and main.cpp could use? For one thing, right now the sampling code is duplicated in both places.

SlyEcho avatar Jun 07 '23 12:06 SlyEcho

For things like shell scripts it would be useful to have something simpler - no need to pass the prompt to stdin and full output in one go as opposed to streaming tokens.

Maybe something like this:

./main \
  -m ./models/7B/ggml-model-q4_0.bin \
  -p "Names for a pet pelican:" \
  -n 512
  --json

Which would output the response to stdout like this, and then exit with a success exit code:

{"response": "Here are some names for a..."}

The goal here would be to support simple scripts - for example bash scripts - that just want to run a prompt and process the result.

Outputting as JSON is particularly useful here as it makes it easy to combine with other tools such as jq.

Because of this, I think the interactive newline-delimited JSON mode I suggested earlier should be invoked using extra options - so --json can work for the simplest case.

Maybe --jsonl or --json --interactive.

simonw avatar Jun 07 '23 12:06 simonw

Yeah, sure. What we discovered is that the code is much simpler, actually, when it doesn't have to deal with reading user input from the terminal.

That's the thing I like about newline-delimited JSON - it's the simplest possible form of interacting with a process, because each write is terminated by a newline character and any reads from the process can continue until a newline character is spotted too.

simonw avatar Jun 07 '23 12:06 simonw

I don't know the state of it currently, but just piping stdout should work right now, it's not JSON but plain text. There was some problems when you tried to add a stop keyword, in that case the program started running in interactive mode and took over the terminal or whatever. Maybe it's been fixed.

SlyEcho avatar Jun 07 '23 12:06 SlyEcho

Yeah, I've tried writing code against the current unstructured output.

It can work, but the calling code ends up pretty messy. More importantly, since that format isn't a documented interface there are no guarantees at all that it won't change in the future in a way that would break my scripts.

simonw avatar Jun 07 '23 13:06 simonw

Also discussed on Twitter here: https://twitter.com/simonw/status/1666382966048837632

simonw avatar Jun 07 '23 15:06 simonw

I'd also love to see this. examples/chat-persistent.sh invokes main for individual completions (rather than as a single interactive process) and resorts to hacks to extract generated text and token counts, for want of something like this.

Lets me keep the model loaded in memory/on the GPU, such that I can send it a new prompt without paying a startup cost each time

mmap should avoid the model startup cost already (although I don't know how that interacts with the GPU)? Also for the prompt processing time on longer prompts, an option now is to use --prompt-cache/--prompt-cache-all, at the cost of disk space/io and additional bookkeeping. chat-persistent.sh uses this. At least on the M2 Max I can start generation from a warm prompt cache of arbitrary length with negligible delay.

ejones avatar Jun 07 '23 17:06 ejones

GPU has more startup overhead, especially when loading the layers to VRAM. On my system rocBLAS takes a few seconds to start up as well, but this is still experimental.

SlyEcho avatar Jun 07 '23 17:06 SlyEcho

It would be awesome to have support for multiple streams and/or logit masking via the protocol. I think a more standardised approach going beyond llama.cpp (e.g. HuggingFace, FastChat, etc.) would make a ton of sense. Happy to collaborate on this also, as we are working on very similar problems in LMQL.

lbeurerkellner avatar Jun 07 '23 19:06 lbeurerkellner

I'd also love to see this. examples/chat-persistent.sh invokes main for individual completions (rather than as a single interactive process) and resorts to hacks to extract generated text and token counts, for want of something like this.

Oh I'd missed that!

https://github.com/ggerganov/llama.cpp/blob/8fc8179919a11738910db07a800f2b176f8adf09/examples/chat-persistent.sh#L26-L27

Yeah, that's exactly the kind of hack I'd like to avoid by having a --json option.

simonw avatar Jun 08 '23 09:06 simonw

I don't know how hard it would be but maybe it's possible to run the server on a socket instead?

Adding this line:

svr.set_address_family(AF_UNIX);

And launching the server like this:

./bin/server --host /tmp/llama.socket ...

Kind of worked:

curl --silent --unix-socket /tmp/llama.socket --url ./completion --data '{
        "prompt": "### Instruction:\nWrite a simple story.\n\n### Response:\n",
        "n_predict": 20
    }' | jq '.content';
"Once upon a time, there was a little girl named Lily who lived in a small cott"

But I'm not sure if it would work on Windows.

anon998 avatar Jun 10 '23 05:06 anon998

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]