Feature request: documented machine-readable output as a top tier feature
I'd like to start building software on top of the llama.cpp family of tools, using JavaScript (Deno) and Python and potentially other languages too.
I don't want to mess around with C bindings for those languages. I would much rather fire up a subprocess and communicate with it from my program over stdin/stdout/stderr.
A documented input and output protocol for doing this would be amazing.
Some ideal characteristics:
- Lets me keep the model loaded in memory/on the GPU, such that I can send it a new prompt without paying a startup cost each time
- Supports streaming, so I can stream tokens back to my user as they become available
- Stable and documented, so I rarely need to rewrite my calling code to adapt to new features
- Supports "chat" mode where appropriate, though this is far less important to me personally than the other features
Here's a suggested design for this. I really like newline-delimited JSON as a way of sending data into and out of a program via standard input and output - it's a really easy format to work with, since each interaction just involves reading or writing a line of text to the process.
So how about this: you start the model running using something like this:
./main -m ./models/7B/ggml-model-q4_0.bin --jsonl
Note the --jsonl option to put it in machine readable newline-delimited JSON mode.
Then to send it a prompt you write a JSON object as a single line to standard input of the process like this:
{"prompt": "Names for a pet pelican", "temp": 0.5, "repeat-penalty": 1.5, "n": 512}
This format can be reduced to just {"prompt": "Names for a pet pelican"}\n to use default settings, and can be expanded to cover dozens of other options.
Then the results start streaming back via standard output. The stream looks like this:
{"content": "Here"}
{"content": " are"}
{"content": " some"}
{"content": " names"}
{"content": " for"}
...
{"end": true}
As you can see, each output token gets its own newline-delimited object, and a special {"end": true} indicates the end of the response.
(Alternative design: the last token in the stream could be {"content": " enjoy!", "end": true} with an extra "end" key.)
Additional information such as logit scores could be incorporated into these output objects as well, maybe controlled by extra options sent along with the prompt.
Here's an example of the kind of thing I've built using this pattern in the past: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/
That runs ripgrep in a process - which provides a --json mode: https://github.com/simonw/datasette-ripgrep/blob/e3b2bb937380170ea729d9ae0abfb95a7002a70e/datasette_ripgrep/init.py#L9
You may be insterested in the API improvements that are pending: #1570
You may be insterested in the API improvements that are pending: #1570
...sure ... and stuck now ....
I think a stdin/stdout mode would still be useful even with an available HTTP API server.
I also want to be able to run prompts from other CLI tools - such as https://github.com/simonw/llm - without needing to start a web server running on an available port, run a prompt through it and then stop the server again afterwards.
Any JSON-over-stdin/stdout mode should use the same JSON design as the server does as much as possible, for consistency - eg to match this:
https://github.com/ggerganov/llama.cpp/blob/df2ecc942a824d5a11cdd6d3083915f28ab24628/examples/server/README.md?plain=1#L46-L48
Yeah, sure. What we discovered is that the code is much simpler, actually, when it doesn't have to deal with reading user input from the terminal.
I don't know how hard it would be but maybe it's possible to run the server on a socket instead? Keeping the server running has the advantage that the model is kept in memory and it lowers the amount of initialization that needs to be done.
Maybe there could be a core that runs the JSON commands that both the server and main.cpp could use? For one thing, right now the sampling code is duplicated in both places.
For things like shell scripts it would be useful to have something simpler - no need to pass the prompt to stdin and full output in one go as opposed to streaming tokens.
Maybe something like this:
./main \
-m ./models/7B/ggml-model-q4_0.bin \
-p "Names for a pet pelican:" \
-n 512
--json
Which would output the response to stdout like this, and then exit with a success exit code:
{"response": "Here are some names for a..."}
The goal here would be to support simple scripts - for example bash scripts - that just want to run a prompt and process the result.
Outputting as JSON is particularly useful here as it makes it easy to combine with other tools such as jq.
Because of this, I think the interactive newline-delimited JSON mode I suggested earlier should be invoked using extra options - so --json can work for the simplest case.
Maybe --jsonl or --json --interactive.
Yeah, sure. What we discovered is that the code is much simpler, actually, when it doesn't have to deal with reading user input from the terminal.
That's the thing I like about newline-delimited JSON - it's the simplest possible form of interacting with a process, because each write is terminated by a newline character and any reads from the process can continue until a newline character is spotted too.
I don't know the state of it currently, but just piping stdout should work right now, it's not JSON but plain text. There was some problems when you tried to add a stop keyword, in that case the program started running in interactive mode and took over the terminal or whatever. Maybe it's been fixed.
Yeah, I've tried writing code against the current unstructured output.
It can work, but the calling code ends up pretty messy. More importantly, since that format isn't a documented interface there are no guarantees at all that it won't change in the future in a way that would break my scripts.
Also discussed on Twitter here: https://twitter.com/simonw/status/1666382966048837632
I'd also love to see this. examples/chat-persistent.sh invokes main for individual completions (rather than as a single interactive process) and resorts to hacks to extract generated text and token counts, for want of something like this.
Lets me keep the model loaded in memory/on the GPU, such that I can send it a new prompt without paying a startup cost each time
mmap should avoid the model startup cost already (although I don't know how that interacts with the GPU)? Also for the prompt processing time on longer prompts, an option now is to use --prompt-cache/--prompt-cache-all, at the cost of disk space/io and additional bookkeeping. chat-persistent.sh uses this. At least on the M2 Max I can start generation from a warm prompt cache of arbitrary length with negligible delay.
GPU has more startup overhead, especially when loading the layers to VRAM. On my system rocBLAS takes a few seconds to start up as well, but this is still experimental.
It would be awesome to have support for multiple streams and/or logit masking via the protocol. I think a more standardised approach going beyond llama.cpp (e.g. HuggingFace, FastChat, etc.) would make a ton of sense. Happy to collaborate on this also, as we are working on very similar problems in LMQL.
I'd also love to see this.
examples/chat-persistent.shinvokesmainfor individual completions (rather than as a single interactive process) and resorts to hacks to extract generated text and token counts, for want of something like this.
Oh I'd missed that!
https://github.com/ggerganov/llama.cpp/blob/8fc8179919a11738910db07a800f2b176f8adf09/examples/chat-persistent.sh#L26-L27
Yeah, that's exactly the kind of hack I'd like to avoid by having a --json option.
I don't know how hard it would be but maybe it's possible to run the server on a socket instead?
Adding this line:
svr.set_address_family(AF_UNIX);
And launching the server like this:
./bin/server --host /tmp/llama.socket ...
Kind of worked:
curl --silent --unix-socket /tmp/llama.socket --url ./completion --data '{
"prompt": "### Instruction:\nWrite a simple story.\n\n### Response:\n",
"n_predict": 20
}' | jq '.content';
"Once upon a time, there was a little girl named Lily who lived in a small cott"
But I'm not sure if it would work on Windows.
This issue was closed because it has been inactive for 14 days since being marked as stale.