llama.cpp Server Example Refactor and Improvements

sever.cpp left out a few generation parameters and also seems built to assume un-editable context with no regens or swipes. As written, it adds on the generated tokens to the end with no easy way for calling frontends to ask it to reprocess them.

Since there are relatively complex context changes happening in a lot of frontends, I added a simple "reload_ctx" flag that can be passed on generation that will cause the prompt to be reloaded. There's likely a smarter way to handle this (not caching the generated tokens or something), but this was quick and easy and will allow for easy usage with things like popular frontends for the time being.

May 23 '23 13:05 digiwombat

Hello, cleaning emb_inp you don't resetting the llama context, In kv_cache, remnants of the previous context may remain, so resetting the context in this way may not yield the best results.

May 23 '23 20:05 FSSRepo

Hello, cleaning emb_inp you don't resetting the llama context, In kv_cache, remnants of the previous context may remain, so resetting the context in this way may not yield the best results.

My testing with regens and continuing the conversation as normal yielded the expected results and seemed to lead to prompt reprocessing. Admittedly it's a hacky way to handle it, but I am not familiar with the underlying llama.cpp codebase enough to make strong decisions about clearing things out. I'm happy to work in any suggestions as to a more robust, proper way to do it.

My main goal here was a quick-fix since the current setup doesn't work well (or at all beyond one gen) with frontends which allow prompt modifications via prompt head changing, swipes, regenerating, or in place context editing.

May 23 '23 20:05 digiwombat

Hello, cleaning emb_inp you don't resetting the llama context, In kv_cache, remnants of the previous context may remain, so resetting the context in this way may not yield the best results.

KV cache does not have to be cleared, it will be overwritten by the next context.

May 23 '23 21:05 SlyEcho

If anyone has any LoRAs handy, I would appreciate making sure the LoRAs load as expected. I copied over the flags from common.cpp, but I don't have any LoRAs on hand to test. If they're an issue, I can just revert.

May 25 '23 19:05 digiwombat

What is interactive: true supposed to do?

May 27 '23 19:05 SlyEcho

I genuinely have no idea. I just hacked together reloading the context and called it a day. I'll investigate here in a bit. I'm also implementing some improvements that got pointed out over on the simple-proxy-for-tavern issue so I'll commit those when I'm done.

May 27 '23 19:05 digiwombat

I honestly think that a new parameter reload_ctx is not needed and it just be the default behavior. If it is desired to reuse the old context, then the API user can send the same starting text again and loadPrompt() is I guess trying to look for that although it seems broken.

May 27 '23 20:05 SlyEcho

What is interactive: true supposed to do?

That option allows for interactions with the model; it stops generating text when it encounters a stop word, whereas when it is disabled, text generation is limited to n_predict.

May 27 '23 21:05 FSSRepo

Okay, context resetting is now automatic any time the context doesn't match exactly. It still runs through the normal check that was there as I'm trying to be as non-destructive as possible since it's possible the way that function interacts with embeddings still works as expected. The benefit here is that same-context gens should start up instantly. It'd be nice to be able to do more complex cache comparisons, but that's beyond the scope for me. Especially since I don't have much time to dig into it.

It works as expected for me in Silly Tavern with the simple-proxy for regens and new messages.

There were some low-hanging fruit bits I grabbed from this excellent issue but didn't change everything as I'm not familiar enough with code to say for sure that those changes are how things should work and I am, sadly, short on time for iterating and testing. I know the stopping token ends up in the blocked words list, so some of the things mentioned might not be issues. I just don't have time to look into them. Maybe someone else can take a look.

May 27 '23 23:05 digiwombat

I also hacked a lot and deleted a lot as well. Basically, the stop keyword search does not have to be done on the token level and actually it is better and easier to do it on the generated string.

Deleted:

exclude: text processing can be done by the client with any kind of method, Regex, whatever.
interactive: just too much unnecessary bloat from main.cpp terminal handling code

https://github.com/SlyEcho/llama.cpp/blob/server_refactor/examples/server/server.cpp

I'll try to rebase it on top of your's tomorrow morning.

May 28 '23 00:05 SlyEcho

Please do. It's definitely much cleaner now with your refactor, so just launch all my stuff into space as needed. I was mostly just getting it working locally since the server as originally committed wasn't working with the proxy and since anon998 was kind enough to slap support in on his project and I wanted to be able to give it a fair shake.

Appreciate all the work from everybody. The API server is a big win for generation speeds in frontends.

May 28 '23 00:05 digiwombat

@digiwombat, I rebased my changes on top of yours, you can look them over and test.

I am gonna try to write some frontend code to test it and to serve as an example, maybe.

There are a couple things that need changing still:

[x] RNG seed should be a JSON parameter, not server CLI agument, if unset then same logic.
[x] threads should be a server CLI argument, not JSON, you shouldn't be able to DoS attack the API with 100000 threads.
[x] batch_size also the same, it affects only how the server hardware is used.
[x] Need to figure out how to set the server timeouts, especially in Debug mode.
[x] Timeouts should be configurable from server CLI
[x] Timeouts should stop the llama_eval process (maybe they already do, have to check)
[x] Figure out a better way to handle invalid UTF-8 than an error.
[x] Returned JSON should have more information:
- [x] how many tokens were generated
- [x] input prompt
- [x] info about the truncation that occurred
- [x] all the sampling parameters used (seed as well)
- [x] the exact stop keyword that was triggered

Nice to haves, maybe for the future:

return N alternate tokens (for last or for all)
allow tokens as input, I think maybe an array where numbers are token codes and strings would be tokenized, there was one model where you had to use special tokens for the assistant and human, something like: [32005, "Hello.", 32006]

May 28 '23 09:05 SlyEcho

I grabbed your refactor and I'll work on getting that stuff in. I'll commit the low-hanging stuff (shifting args/params to the right places) and then look into timeouts and the extra JSON data.

Is there a value in returning the input prompt considering it must necessarily come from the caller? Just for sanity checking? I think it would add a lot of overhead to token streaming. Though maybe it's worth splitting out the handling there. Or just passing it on stop. I'll take a look at it.

Edit: According to a comment in the file, the UTF-8 thing is a JSON parser issue. Not sure what the path forward is on that. The way it's written, it just kicks back an empty response when it's getting invalid UTF-8 stuff, which I think I ran into yesterday when one model just kept returning empty. Only happened once while I was testing some other things so I didn't think much of it.

May 28 '23 10:05 digiwombat

Is there a value in returning the input prompt considering it must necessarily come from the caller?

Let's think about it later, then.

UTF-8 thing is a JSON parser issue. Not sure what the path forward is on that.

I think if detected, it could be replaced with the replacement character which is in C++ maybe "\xEF\xBF\xBD" as UTF-8.

May 28 '23 10:05 SlyEcho

Setting timeouts look pretty trivial. I'll assume that's best defined by the server (edit just saw you said that in the checklist). Not sure what a good default is. Non-streaming generation could take a while, so I'll default to 600 seconds. Don't want anyone with slow hardware or doing long gens to get cutoff. Let me know if that seems too high. I'm more interested in avoiding timeouts for non-streaming than the long open socket times since I imagine the primary use will be local. I'm also setting the read and write timeouts to be the same.

Might make more sense to split them. Or to just set the read timeout and keep the write timeout lower. Not sure what the httplib default is. Again, I am biasing a bit toward local use-case which may have more esoteric needs, though I'm always in favor of more granular control.

replaced with the replacement character which is in C++ maybe "\xEF\xBF\xBD" as UTF-8.

I'll add that in. I'll switch the catch to:

json data = {
        {"content", "\uFFFD" },
        {"stop", !llama.has_next_token }};

for streaming. That should give them � instead of nothing, which should at least point to something being weird. The non-streaming version dumps out with a Bad encoding token message which seems fine for now.

May 28 '23 11:05 digiwombat

Okay, added most of the checklist stuff. Streaming requests return the extra values along with the last token when the generation is done. It's all pretty small potatoes for data use, but I think that hits a nice middle ground since things aren't apt to change in flight. Streaming requests also return the full generated text on the last token so people can do whatever they want with that. Code is simple enough that people could switch it to send on every token.

Somehow the as_loop setting got removed somewhere in there, so I added that back in to fix up streaming requests. Also, I noticed this last night, but the output never seems to stop now. It generates a reasonable reply and then usually goes into spitting out Greek or Cyrillic. Not sure if that's just the generation settings being set via simple-proxy or something else. I'd love to know if anyone can test that for me.

I think I covered everything except the truncation stuff. I'd need some advice on how to handle that since I'm not sure where it's being captured, if at all or what we'd want to capture. Still, currently builds and runs so other than the overrun its running fine.

May 28 '23 13:05 digiwombat

Also, I noticed this last night, but the output never seems to stop now. It generates a reasonable reply and then usually goes into spitting out Greek or Cyrillic.

I haven't tested it much but it seems to stop when using a stopword or token limit.

May 28 '23 14:05 SlyEcho

@SlyEcho

[x] La semilla RNG debe ser un parámetro JSON, no un argumento CLI del servidor, si no está configurado, la misma lógica.

I had considered that before, but upon reviewing it, I realized that the seed parameter is referenced only when loading the model, not when it starts generating tokens. Perhaps I didn't analyze the sampling code properly.

struct llama_context * llama_init_from_gpt_params(const gpt_params & params) {
    auto lparams = llama_context_default_params();

    lparams.n_ctx        = params.n_ctx;
    lparams.n_gpu_layers = params.n_gpu_layers;
    lparams.seed         = params.seed; // The seed parameter is necessary when loading the model.
    lparams.f16_kv       = params.memory_f16;
    lparams.use_mmap     = params.use_mmap;
    lparams.use_mlock    = params.use_mlock;
    lparams.logits_all   = params.perplexity;
    lparams.embedding    = params.embedding;

    llama_context * lctx = llama_init_from_file(params.model.c_str(), lparams);

May 28 '23 17:05 FSSRepo

A "ignore_eos" parameter for the completion endpoint would be cool similar to this one: https://github.com/ggerganov/llama.cpp/blob/d2c59b8ba498ab01e65203dde6fe95236d20f6e7/examples/common.cpp#L301-L302

Trying to make the AI send a multibyte characters like an emoji just ends up in a error with status 500, there's no error message nor anything printed in the server.

I have seen code like this in llama-cpp-python to handle these but I don't understand it well: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L757-L768

May 28 '23 17:05 anon998

Intente que la IA envíe caracteres de varios bytes como un emoji acaba en un error con el estado 500, no hay ningún mensaje de error ni nada impreso en el servidor.

It's a limitation of the JSON parser: https://json.nlohmann.me/features/types/#encoding

May 28 '23 17:05 FSSRepo

A "ignore_eos" parameter for the completion endpoint would be cool similar to this one:

https://github.com/ggerganov/llama.cpp/blob/d2c59b8ba498ab01e65203dde6fe95236d20f6e7/examples/common.cpp#L301-L302

I can add this:

if (!body["ignore_eos"].is_null() && body["ignore_eos"].get<bool>())
  {
      llama.params.logit_bias[llama_token_eos()] = -INFINITY;
  }
  else
  {
      llama.params.logit_bias.erase(llama_token_eos());
  }

And it would allow for sending ignore_eos per generation.

As for the 500 error, it should now send back a Unicode replacement character now. Unless the json is choking on it anyway. I'd think the try/catch would catch it.

May 28 '23 17:05 digiwombat

Also, I noticed this last night, but the output never seems to stop now. It generates a reasonable reply and then usually goes into spitting out Greek or Cyrillic.

I haven't tested it much but it seems to stop when using a stopword or token limit.

~~This is the proxy request that @digiwombat is sending:~~

{ "prompt": "### Human: Question?\n### Assistant: Response", "n_predict": 1270, "stop": [ "\n##", "\nYou:", "</s>", "\n### Human:", "\n### Assistant:" ], "as_loop": true, "interactive": true, "reload_ctx": true, "temperature": 0.72, "top_p": 0.73, "top_k": 0, "typical_p": 1, "tfs_z": 1, "repeat_penalty": 1.1, "repeat_last_n": 2048, "n_keep": -1 }

Note that it sends several strings to stop the generation, I don't know if it's the most correct way, but it's how most front ends are working. And I believe the server in llama.cpp is not expecting this.

Sorry, ignore it i just checked... llama.params.antiprompt = body["stop"].get<std::vector<std::string>>();

May 28 '23 17:05 LiliumSancta

Trying to make the AI send a multibyte characters like an emoji just ends up in a error with status 500, there's no error message nor anything printed in the server.

It should just generate another token until the generated text ends in a complete UTF-8 sequence, I will try that later.

I realized that the seed parameter is referenced only when loading the model, not when it starts generating tokens.

There is a function to set the seed later. But I should mention that I didn't get it to work deterministically. If you send the same input over and over to the API it will generate different outputs even given the same seed. There may be more state that needs to be reset in llama.cpp that doesn't come out from main.cpp because it doesn't ever restart generation from the beginning.

May 28 '23 17:05 SlyEcho

@digiwombat, maybe a quick change as well to change --memory_f32 to --memory-f32?

common.cpp supports both, actually but --memory-f32 is the one in the help text.

This was changed in #1416

May 28 '23 17:05 SlyEcho

Well, it was apparently already done on master... :rofl:

May 28 '23 18:05 SlyEcho

"--ctx_size" is missing from the help text. It's also called "--ctx-size" in common.cpp, although I still see the underscore one in various other places.

fprintf(stderr, "  -c N, --ctx-size N    size of the prompt context (default: %d)\n", params.n_ctx);

May 28 '23 18:05 anon998

"--ctx_size" is missing

I even scrolled the list before I committed the --memory-f32 one. YouTube on the third monitor is a terrible coding assistant. lol. Thanks anon.

May 28 '23 18:05 digiwombat

The logic here seems to be wrong. It sets "has_next_token" to false when it finds the eos token but it later overwrites that value. https://github.com/ggerganov/llama.cpp/blob/6c58f64a3bef5d1306729a9b0820084b9b074aa8/examples/server/server.cpp#L236-L241

May 28 '23 20:05 anon998

Yep, that should have a return, I'm guessing. Maybe there was one and a ghost deleted it. A blameless ghost. I'll throw one back in.

Edit: Yep. That fixed it. Nobody look through the commit history. It's haunted.

May 28 '23 20:05 digiwombat

Ah, seems we've got a regression. Regens are broken again. That continue context like they were doing before.

I can either put my context delete logic back in or @SlyEcho can take a look.

May 28 '23 22:05 digiwombat

llama.cpp llama.cpp copied to clipboard

Server Example Refactor and Improvements

llama.cpp
llama.cpp copied to clipboard