llama.cpp
llama.cpp copied to clipboard
Refactor most code in main.cpp into a separate module (preparing to implement TCP mode)
The goal of this refactor is allow reusing the model execution while using streams other than stdin/stdout for interaction.
In my case, I'd like to implement a simple TCP server (which is enabled as a command-line option) that will run llama_main
for each new connection, which will be handled in a child process via fork()
. This would bring a few benefits:
- Loading model weights can be very slow, so in TCP mode we can load it before listening. Each new connection is handled in a forked process, which inherits the parent's memory (so doesn't have to reload the model)
- We can quickly start a new context by opening a new TCP socket. New connections will also be able to specify some new parameters such as seed and prompt.
- It becomes easier to wrap this into a REST/HTTP server
- Can be more convenient in a LAN where you have a powerful computer as the model server.
If this PR is accepted, I will follow up with a PR that implements the TCP server command line option
This PR is simpler to review than it appears. Just look at the commits individually (most of the additions/deletions happen in the first commit, where main.cpp
is simply renamed as llama.cpp
).
How does this PR tie into the current active refactor here #77 ?
How does this PR tie into the current active refactor here https://github.com/ggerganov/llama.cpp/pull/77 ?
I was not aware of that PR, I should have searched it first. The only reason I created this PR is because I had a clear vision of how to implement a TCP server mode into llama.cpp. Honestly not sure what to do, should I close this PR?
Honestly not sure what to do, should I close this PR?
Not my call, but you could review the other PR with your insight :)
Not my call, but you could review the other PR with your insight :)
I had a quick look and it seems that the goal in #77 is to make llama.cpp embeddable as a library, which requires modifying/refactoring more than what I do here.
This PR has no such goals and makes almost no changes to existing code. It can be summarized as:
- Most code in main.cpp moved to llama.cpp. Didn't split existing functions, only created llama_main which has most code of the old main function.
- llama_main now accepts the following as arguments:
- parsed parameters
- preloaded model
- input/output/error streams which are now used instead of hardcoded stdin/stdout/stderr
@tarruda Adding a TCP server would be awesome! Please keep doing this - for now, do it on a branch in this repo as you find best. Just invited you as a collaborator. I will review #77 very soon and merge it first. After that, we will update your changes to fit the C-style API
Closing in favor of #278