llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Create json api service

Open wizd opened this issue 1 year ago • 8 comments

so we can intergrate app/UI.

wizd avatar Mar 13 '23 10:03 wizd

emulate openai text api, so tons of apps could support llama without change.

wizd avatar Mar 13 '23 19:03 wizd

+1 on this, people would love to have this in KoboldAI but we have no good way of implementing it at the moment. We already have OpenAI support so that would work, we also have a different basic json API that just sends the desired values over json and handles the output string.

Whatever way works, but doing json over http is going to be ideal for cross language implementations such as python or (in browser) javascript.

henk717 avatar Mar 14 '23 01:03 henk717

Sounds like the ideal structure of this would be to load the model into memory in interactive mode, listen for input on some port, then wait for initial prompt & reverse prompt, then post the json response to that same port. Because it outputs word by word, maybe a websocket implementation?

This seems like a viable option too: https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1465145365

MLTQ avatar Mar 14 '23 20:03 MLTQ

Websocket is an option, but would you be willing to pay whomever will host the backend?

i-am-neo avatar Mar 17 '23 15:03 i-am-neo

Hi @henk717 I've gone ahead and created https://github.com/LostRuins/llamacpp-for-kobold which emulates a KoboldAI HTTP server, allowing it to be used as a custom API endpoint from within Kobold.

I wrote my own python ctypes bindings, and it requires zero other dependencies (no Flask, no Pybind11) except for llamalib.dll and Python itself. Windows binaries are included, but you can also rebuild the library from the makefile.

I also went ahead and added left square brackets to the banned tokens.

Unfortunately, it's not very ideal due to a fundamental flaw in llama.cpp where generation delay scales linearly with prompt length unlike on Huggingface Transformers. See this discussion for details.

LostRuins avatar Mar 18 '23 16:03 LostRuins

Hey guys, if anyone is seeking for working client/server implementation; I wrote a minimal realtime Go server and Python client with live inference streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

avilum avatar Mar 19 '23 20:03 avilum

I have a proof of concept working with an existing web UI here:

https://github.com/oobabooga/text-generation-webui/pull/447

It is very unpolished, but getting somewhere.

thomasantony avatar Mar 20 '23 03:03 thomasantony

Hi there, I recently worked on C# bindings and a basic .NET core project. There are two sample projects included (CLI/Web + API). It could be easily be expanded with a more extensive JSON interface. Hope this is helpful.

https://github.com/dranger003/llama.cpp-dotnet

dranger003 avatar Apr 23 '23 22:04 dranger003