llama.cpp Proof of concept TCP server mode

This builds on my other PR to implement a very simple TCP mode.

The new mode first loads the model then listens for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol:

First the number of arguments will be read followed by a newline character.
Then each argument will be read, separated by the 0 byte.
With this we build an argument vector, similar to what is passed to the program entry point.
The resulting "argv" is passed gpt_params_parse.

Finally llama_main will be executed with the input/output streams connected to the socket.

I've included two sample bash scripts which can be used to test the new mode. This is how it works:

Run ./chat_tcp_server.sh in a terminal.
In a second terminal, run ./chat_tcp_client.sh. This will connect to the server and start a sample chat session.

One thing to note is that this mode is only implemented for Unixes. There's two reasons for that:

I never wrote win32 TCP code, so I'm not familiar with the API
There's really no advantage in implementing this for win32 because it doesn't support fork(). The main advantage of using this mode is that it serves each connection in a separate process which inherits memory from the parent (so the model only has to be loaded once).

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app.

Mar 19 '23 02:03 tarruda

Shall we close #267 now that we have this?

Mar 19 '23 18:03 ggerganov

Shall we close #267 now that we have this?

We can, this already includes all changes in #267. Only reason I did in a separate PR was to simplify reviewing.

Mar 19 '23 18:03 tarruda

Rebased

Mar 19 '23 19:03 tarruda

If anyone is seeking for working client/server implementation, I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

Mar 19 '23 20:03 avilum

If anyone is seeking for working client/server implementation, I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

@avilum One problem with your implementation that the service spawns a new llama.cpp instance for every http request, so it can take a long time to respond (model has to be loaded every time).

I suggest you try using this branch with a different approach: Start llama.cpp once, and on every http request you create a TCP connection to the singleton instance. This will get you a clean environment for every request without the overhead of reloading the model.

Mar 19 '23 21:03 tarruda

If anyone is seeking for working client/server implementation, I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

@avilum One problem with your implementation that the service spawns a new llama.cpp instance for every http request, so it can take a long time to respond (model has to be loaded every time).

I suggest you try using this branch with a different approach: Start llama.cpp once, and on every http request you create a TCP connection to the singleton instance. This will get you a clean environment for every request without the overhead of reloading the model.

For the sake of POC I ran the process every prompt. I most definitrly Agree - I am working at the moment on loaded the DLL only once (main exe) and once this is merged, I might evensend the data over this TCP socket. I'll load the model and then feed it with inputs, from the Go code, over any kind of socket / IPC.

Mar 19 '23 22:03 avideci

I disagree with this change. This is a large rearchitecting of the project that fundamentally changes its vision. No one will run your daemon. What value does having a TCP server mode offer, aside from fixing loading time? The issue of loading time is solved by #91 which we've implemented in the mmap branch: https://github.com/ggerganov/llama.cpp/tree/mmap It will be much easier to support win32 using mmap than it would be to support winsock.

Mar 20 '23 03:03 jart

Why even the need of spawning multiple processes instead of multiple threads? Threads are already built-in, unlike sockets or mmap. The only truly portable thing is standard I/O, which can be redirected and also easily communicated with using simple file streams which are supported by everything. Instead of changing the main implementation much at all, you could just build any modules outside the main implementation as modules and communicate using these simple file streams. The main llama input would not need any sort of 'protocol' but just listen to '\n' or EOF like it currently does, and whatever modules could just follow that paradigm while communicating through the file streams. Am I missing something here?

Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Mar 20 '23 05:03 anzz1

@jart

I disagree with this change. This is a large rearchitecting of the project that fundamentally changes its vision.

That is true. It is better to keep the scope focused and make sure llama.cpp is as stable as possible.

@tarruda

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app.

We can bundle llama with node.js in its current form. There is a library called caxa: https://github.com/leafac/caxa

It bundles any nodejs app into a single executable. As a side effect of the way it is architected it unzips any executeables from the dist folder at runtime.

So we can just place a compiled llama version in there and bundle it together. Then we can build a rest-api in node that can be easily called from nextjs.

I also found a way that turns an openapi.yml file into the single source of truth for the nodejs routes. The result of this is, that we get an interactive docs page for free that spits out curl commands to interact with the api.

Here is the repo that combines caxa and this technique to make a single binary rpc daemon: https://github.com/spirobel/monerochan-merchant-rpc

If there is an interest I can make something like this for llama.cpp.

If we make a one click template for a provider like digital ocean people could spin up their own on demand instances that work just like the openAI api.

Mar 20 '23 09:03 spirobel

This is a large rearchitecting of the project that fundamentally changes its vision

@jart @spirobel there's no rearchitecting here, in fact my goal was to introduce a server/client model with minimal changes. If you don´t pass the -l option, the app works exactly as it is now. This PR might seem like it is changing a lot, but if you review the commits individually, you will see there's not much change to existing logic.

The way this was implemented was basically by moving all the code (except for model loading code) from the main function to another reusable function called llama_main, then replacing the direct std streams references by parameters passed to it. So the same code runs unmodified on stdin/stdout/stderr or on a TCP connection.

What value does having a TCP server mode offer, aside from fixing loading time?

There are more unexplored applications of this TCP server mode. Here's a few ideas:

Wrap into an http/json/rest/websocket server for using in a web application. Eventually I would like to write an API that is compatible with ChatGPT's API.
Do any kind of processing (such as preload the prompt) prior to accepting connections/requests. This would allow you to essentially respond to queries more quickly by preloading a chatbot prompt, then for each request just process user input.
Server/client usage, you can run llama.cpp in a more powerful computer and share with other devices in the same LAN.

@ggerganov also thinks this is a good idea: https://github.com/ggerganov/llama.cpp/pull/267#issuecomment-1474916599

I don't think this PR is mutually exclusive with the work you are doing, it is still useful to load the model faster on new process startups.

Mar 20 '23 09:03 tarruda

@tarruda

Do any kind of processing (such as preload the prompt) prior to accepting connections/requests. This would allow you to essentially respond to queries more quickly by preloading a chatbot prompt, then for each request just process user input.

Responsiveness and proper concurrency is always good! 😀👍 Nothing more frustrating than lagging or hung up programs.

That being said,

Wrap into an http/json/rest/websocket server for using in a web application. Eventually I would like to write an API that is compatible with ChatGPT's API.

please take a look at the single binary rpc that I built! I am happy to answer any questions! I genuinely believe it is better to implement this in nodejs instead of doing it in cpp. Monero also has a cpp rest rpc daemon and it is not fun to work with. It always hangs up or becomes unresponsive for long times. Also the documentation is always out of date. Using openapi (not openai 😅) as a single source of truth for routing and documentation solves this issue permanently.

We can bundle llama with node.js in its current form. There is a library called caxa: https://github.com/leafac/caxa

It bundles any nodejs app into a single executable. As a side effect of the way it is architected it unzips any executeables from the dist folder at runtime.

So we can just place a compiled llama version in there and bundle it together. Then we can build a rest-api in node that can be easily called from nextjs.

I also found a way that turns an openapi.yml file into the single source of truth for the nodejs routes. The result of this is, that we get an interactive docs page for free that spits out curl commands to interact with the api.

Here is the repo that combines caxa and this technique to make a single binary rpc daemon: https://github.com/spirobel/monerochan-merchant-rpc

If there is an interest I can make something like this for llama.cpp.

If we make a one click template for a provider like digital ocean people could spin up their own on demand instances that work just like the openAI api.

Mar 20 '23 09:03 spirobel

@anzz1 I don't see how it could work using threads. There's only one instance of the model in memory, AFAIK ggml API is not thread-safe or supports concurrent usage (@ggerganov please correct me if I'm wrong).

Mar 20 '23 09:03 tarruda

implement this in nodejs instead of doing it in cpp

@spirobel If you want to implement a server mode in another program/language such as node.js and without changes to llama.cpp, there's two ways I see you can go about it:

Spawn a single instance of llama.cpp at server startup, and synchronize access to stdin/stdout between all requests. This might work well for a single user, but if llama.cpp the process crashes, you would have to restart it. Not to mention you would not be able to customize parameters per connection/request.
Spawn a new llama.cpp instance for each request (this is what @avilum did), which would be very slow with the way the model is loaded right now (also requires duplicating the model in memory). I'm not familiar with the mmap loading solution, but assuming it makes loading the model instant, would it be able to preload the prompt and/or customize the prompt/seed per request? Preloading the prompt is not implemented right now but it is something that can be done in the server mode I implemented here.

I agree that higher level abstractions are better done in platforms like node.js or python, but in this case I don't think it would be possible to implement a server in purely node.js and have the same efficiency of a fork/connection server approach.

Now, here is the last paragraph of the PR description

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app."

As you can see, my goal with this PR was to provide a base server/protocol that can be wrapped in a higher level API. In fact I implemented the parameter passing protocol in a way to allow reusing the existing gpt_params_parse function, so any new parameters added are automatically supported. The server mode is basically a "zygote" for quickly spawning new instances of llama.cpp with custom parameters (except for n_ctx, which has to be passed to ggml_init).

Technically I could have implemented a higher level abstraction such as a simple HTTP endpoint that parses json messages, but it would require me to add at least a JSON parsing library, which goes against the goals of the project ("no dependencies"). I also think we would lose some flexibility by making assumptions about the format of the API, better to have a lower level implementation that can be tailored for various purposes.

Since this server mode is meant to be wrapped in a higher level abstraction, it might be better to implement it using Unix sockets instead of TCP sockets, which I might do later after getting some feedback. This is still an experiment/POC.

Mar 20 '23 11:03 tarruda

@tarruda Please check this comment https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1477080912 If the TCP server was refactored as a module and use a generic C style of sharing the preloaded model inside the memory of a single process and use threads for multi-serve capability instead, foregoing the use of fork(), I could help porting this to winsock.

Mar 21 '23 00:03 anzz1

I’m personally uncomfortable with this because I don’t believe new C code should be exposed directly to the internet, especially when we won’t be using security validated code to handle the network requests for us. I think this project should focus on offering a robust C-based API and then leave tasks like a HTTP API to inherently safer languages.

Mar 21 '23 03:03 j-f1

I’m personally uncomfortable with this because I don’t believe new C code should be exposed directly to the internet, especially when we won’t be using security validated code to handle the network requests for us. I think this project should focus on offering a robust C-based API and then leave tasks like a HTTP API to inherently safer languages.

@j-f1 Yeah, one solution to address your concern is to create a C-based API that can be used internally within the system, while using a different language such as Python/Node.js to handle external network requests.

Mar 21 '23 11:03 KyL0N

I’m personally uncomfortable with this because I don’t believe new C code should be exposed directly to the internet

This TCP mode is not meant to be used directly, in my previous comments I've hinted that I created this as a lower level protocol meant to be wrapped in a higher level solution, possibly written in Node.js or Python.

Right now a loopback address is hardcoded:

so if you want to use this TCP mode over a network, it is necessary to wrap in a proxy such as socat.

Mar 21 '23 12:03 tarruda

It looks that this PR refactors current code base too much. That's not big problem if the changes are urgent, but this is arguable.

First of all, there are too much things to do in this hot repo, various people want to add/fix something. More than a thousand forks have been created. The llma-rs copied ggml as library.
If a PR changes code base a lot, that may create unnecessary conflicts with pending pull requests to merge. Perhaps it's better to break down a somewhat big PR to smaller ones, and avoid deleting/renaming existing core files.
Perhaps it's not easy to write a production level chat server in C++.

So, I recommend collecting more feedbacks before merging this PR into mainline, and do some formal design. Please let me list several possible APIs for constructing a chat server:

API to init the inference engine, which returns engine object. Explicit server config required, that can be loaded from config file later.
APIs to start/close/resume a conversation session, one-shot or interactive. Conceptually, a engine object manages sessions.
API(s) for conversations. Conceptually, a session object manages talks.
Other possible APIs, for example: server metrics.

If we could write these APIs (in C), it's possible to build chat servers in almost any popular programing language, with protocols like HTTP, GRPC, WebSocket. Before that, we could design and write C++ APIs on top of current code base.

FYI, best regards :)

Mar 21 '23 17:03 mqy

It looks that this PR refactors current code base too much.

If you consider replacing global references (stdin/stdout/stderr) with function parameters "too much refactoring", then yes. Really, review the commits individually, you will see the changes to existing code are easy and actually good even if the TCP server module is not merged . I had created a prior PR #267 with just these base changes, because I considered them worth in isolation.

So, I recommend collecting more feedbacks before merging this PR into mainline, and do some formal design.

No one is in a rush to merge this, I split the steps into separate commits and it is very easy for me to keep rebasing, which is what I will do.

Please let me list several possible APIs for constructing a chat server:

I appreciate the suggestion but this is outside of the scope of what I'm willing to do. I wanted to introduce networking capabilities with minimal changes to existing code or architecture. If someone wants to do these more elaborate changes, they are free to do so in a separate PR, I will happily close this PR if there's a better implementation.

Mar 21 '23 19:03 tarruda

Redid the commits on top of the latest C API changes. Now that the C API is implemented on llama.cpp, I've moved the program main loop to run.cpp.

Seems like the resulting additions/removals is smaller now.

Mar 22 '23 13:03 tarruda

I would like this to become a standalone example in the "./examples" folder. The main.cpp example has to remain the way it is on master. Even if you have to duplicate the code from main in the tcp_server example - that is OK.

@ggerganov I'm not sure if I understand. Do you want me to copy all the code in main.cpp to tcp_server.cpp and have it become a standalone program?

Mar 22 '23 18:03 tarruda

I have some comments as a result of actually trying to wrap this with a node client last night:

I think host bind address should be an option because
1. the service may be running on a private network on a distinct host from the http proxy
2. the service may be running on a container (without host networking) and therefore would have its own virtual network interface, which would make this service unreachable without unnecessarily colocating a proxy inside the container
3. I added a host option here
There is no clear signal to the client when the user input is pending. You could rely on the color ANSI, but only in color mode. And that seems a bit flaky.
There is no way for the client to reset the conversation, without causing unnecessary round trips (TCP handshake)
There is no method for the client to stop ongoing processing without disconnecting the socket and losing the conversation state.
There is no clear signal to the client of how many tokens are remaining.
There is no reliable indication to the client of which model file is loaded, or other model metadata -- possibly some can be gleaned from debugging log messages, but this is not reliable enough for an integration.

I remediated 2, 5, and parts of 6 here. I did this by replacing the raw console output with a plaintext, line-based (irc-like/smtp-like) protocol. One message per line (with control characters escaped), with a message type keyword as the first word on each line. I implemented the keywords:

HELO: Sent to the client upon initial connect.
FATAL (.+): For any unrecoverable error with a message.
DEBUG (.+): For human-readable debug messages.
OUTPUT (.+): For model outputs, sent as they become available.
PROMPT (.+): When the model begins responding to a prompt, the prompt received is echoed back to the client in this line.
KV ([\w\d_-]+)=(.+): For sending named key-value pairs to the client. For example, interactive_mode, remaining_tokens, seed, awaiting_prompt, etc.

I didn't touch the input protocol used to start the model, or remediate all of the issues because frankly my C++ isn't good enough :joy:. I don't know how to manipulate the input stream like that -- i think it might need to be handled in a separate thread and then sent to the thread executing lambda_main, but that's above the time I have to invest in a side project, so, I just did what i could.

Sample Output

>> HELO
>> KV seed=1679516412
>> PROMPT  Transcript of a dialog, where the user interacts with an assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User\'s requests immediately and with precision.\n\nUser:Hello, Bob.\nBob:Hello. How may I help you today?\nUser:
>> KV prompt_tokens=68
>> DEBUG 1 -> ''
>> DEBUG 4103 -> ' Trans'
>> DEBUG 924 -> 'cript'
>> DEBUG 310 -> ' of'
>> DEBUG 263 -> ' a'
>> DEBUG 7928 -> ' dialog'
>> DEBUG 29892 -> ','
>> DEBUG 988 -> ' where'
>> DEBUG 278 -> ' the'
>> DEBUG 1404 -> ' user'
>> DEBUG 16254 -> ' interact'
>> DEBUG 29879 -> 's'
>> DEBUG 411 -> ' with'
>> DEBUG 385 -> ' an'
>> DEBUG 20255 -> ' assistant'
>> DEBUG 4257 -> ' named'
>> DEBUG 7991 -> ' Bob'
>> DEBUG 29889 -> '.'
>> DEBUG 7991 -> ' Bob'
>> DEBUG 338 -> ' is'
>> DEBUG 8444 -> ' helpful'
>> DEBUG 29892 -> ','
>> DEBUG 2924 -> ' kind'
>> DEBUG 29892 -> ','
>> DEBUG 15993 -> ' honest'
>> DEBUG 29892 -> ','
>> DEBUG 1781 -> ' good'
>> DEBUG 472 -> ' at'
>> DEBUG 5007 -> ' writing'
>> DEBUG 29892 -> ','
>> DEBUG 322 -> ' and'
>> DEBUG 2360 -> ' never'
>> DEBUG 8465 -> ' fails'
>> DEBUG 304 -> ' to'
>> DEBUG 1234 -> ' answer'
>> DEBUG 278 -> ' the'
>> DEBUG 4911 -> ' User'
>> DEBUG 29915 -> '\''
>> DEBUG 29879 -> 's'
>> DEBUG 7274 -> ' requests'
>> DEBUG 7389 -> ' immediately'
>> DEBUG 322 -> ' and'
>> DEBUG 411 -> ' with'
>> DEBUG 16716 -> ' precision'
>> DEBUG 29889 -> '.'
>> DEBUG 13 -> '\n'
>> DEBUG 13 -> '\n'
>> DEBUG 2659 -> 'User'
>> DEBUG 29901 -> ':'
>> DEBUG 10994 -> 'Hello'
>> DEBUG 29892 -> ','
>> DEBUG 7991 -> ' Bob'
>> DEBUG 29889 -> '.'
>> DEBUG 13 -> '\n'
>> DEBUG 29362 -> 'Bob'
>> DEBUG 29901 -> ':'
>> DEBUG 10994 -> 'Hello'
>> DEBUG 29889 -> '.'
>> DEBUG 1128 -> ' How'
>> DEBUG 1122 -> ' may'
>> DEBUG 306 -> ' I'
>> DEBUG 1371 -> ' help'
>> DEBUG 366 -> ' you'
>> DEBUG 9826 -> ' today'
>> DEBUG 29973 -> '?'
>> DEBUG 13 -> '\n'
>> DEBUG 2659 -> 'User'
>> DEBUG 29901 -> ':'
>> KV interactive_mode=true
>> KV reverse_prompt="User:"
>> KV temp=0.700000
>> KV top_k=40
>> KV top_p=0.500000
>> KV repeat_last_n=500
>> KV repeat_penalty=1.200000
>> OUTPUT  Transcript of a dialog, where
>> OUTPUT  the user interacts with an assistant named
>> OUTPUT  Bob. Bob is helpful, kind,
>> OUTPUT  honest, good at writing, and never
>> OUTPUT  fails to answer the User\'s requests
>> OUTPUT  immediately and with precision.\n\nUser
>> OUTPUT :Hello, Bob.\nBob:
>> OUTPUT Hello. How may I help you today
>> OUTPUT ?\nUser:
>> KV awaiting_prompt=true

Mar 22 '23 20:03 C2D03041

@ggerganov These are the changes I did to main:

Moved main I/O loop out of the main() function into a run() function which can be reused
In the run() function, changed fprintf(stderr,... to fprintf(errstream, ... where errstream is a FILE * parameter (main() passes stderr)
In the run() function, changed printf(... to fprintf(outstream, ... where outstream is a FILE * parameter (main() passes stdout)
In the run() function, changed std::getline(std::cin,... to std::getline(instream,... where instream is a std::istream parameter (main() passes std:cin)

The main.cpp example has to remain the way it is on master.

Do you mean I should revert the changes listed above? These changes are mandatory for the implementation of the tcp server.

Mar 22 '23 20:03 tarruda

@tarruda

@ggerganov I'm not sure if I understand. Do you want me to copy all the code in main.cpp to tcp_server.cpp and have it become a standalone program?

Yes, the main.cpp is the example that everybody will run first when they clone the project. It has to be straightforward, demonstrating basic usage of llama.cpp. The run() abstraction is not necessary here.

Mar 22 '23 20:03 ggerganov

The run() abstraction is not necessary here.

The run abstraction is necessary if we want to share the main loop with the tcp server, it is not practical for me to copy all the code in main and keep duplicating the changes back to a separate program whenever the main loop is updated. I will maintain this in my own private fork then.

Mar 22 '23 21:03 tarruda

@ggerganov I feel that we are losing a lot here - I, for one, would love to be able to use @tarruda's fork with the API. Is there a way to add the API to this project?

Mar 23 '23 15:03 tkafka

@tkafka I'm still maintaining these changes in my fork, and will keep rebasing for the foreseeable future (might even set up some script to do this semi-automatically in a daily basis). Here's the code: https://github.com/tarruda/llama.cpp/tree/tcp_server just rebased it.

Mar 23 '23 16:03 tarruda

Looks like the source files will be re-structured sooner or later. https://github.com/ggerganov/llama.cpp/issues/384#issuecomment-1480276524

Mar 23 '23 16:03 mqy

@mqy since I started this PR, the files have been restructured multiple times. I will just keep updating the main example to support tcp until there's a better native solution that doesn't rely on copy and pasting code.

I would have been fine if tcp_server was a separate program from main, as long as the main loop was in a reusable module (which is what I've done in "run.cpp"). Until there's a better option, I will just keep rebasing the main example.

Mar 23 '23 16:03 tarruda

Hi, i use your TCP fork and its working very well for my usecase This is a very important feature that should be merged imo

Mar 29 '23 10:03 sowa705

llama.cpp llama.cpp copied to clipboard

Proof of concept TCP server mode

llama.cpp
llama.cpp copied to clipboard