llama.cpp added implementation of DRY sampler (post-refactor)

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

This is a continuation of the previous PR #6839 to implement DRY sampling (and I think a pretty faithful port of Koboldcpp’s impressive implementation, which itself is based on @p-e-w‘s original design and ideas as explained here), but now overhauled to use the new sampler API. I added @l3utterfly as co-author in a recent commit of this PR.

Here are some important notes:

I may have also fixed #8971 (or it may turn out to be more of a bandaid), but by moving/forcing sampling to happen over the entire prompt at the conclusion of prompt processing in server.cpp, I was able to get consistent sampling of the prompt, regardless of caching or whether it is the first generation or not.
I also fixed a recent breaking change that was made, though I am not sure when this change happened. For quite a while it was possible (and indeed I have heavily relied on the feature in my own software for a long while now) to be able to submit a mixed array of strings and token IDs as a singleton prompt. This I believe was originally implemented from PR #2306. Apologies if this is considered out of scope here, and I can remove it or move it to a separate PR if need be, but to be honest this feature was the main reason why I converted to using llama.cpp's server. It also seems this fix restores the feature and doesn't stop the other prompting options from working either, but if I missed something here, please let me know.
I added a context size parameter to gpt_sampler_init(). If there is a smarter way to handle this, I am happy to do it in a less drastic way, but I couldn't immediately figure that out.
The classic issue that I have dealt with from day one with this where I can't figure out how to get the vector of strings for the sequence breakers from server.cpp/common/API land to src/llama-sampling land is still here. I again "cheated" with a C++ section at the bottom of llama.h and made a separate function for setting the sequence breakers apart from the regular init function. There is probably some clever way to avoid this or to use a void or extern that I am not as experienced with. I welcome any help or suggestions with that or any of these other issues.
Finally, I did not implement a CLI argument for sequence breakers. If anybody wants to take a whack at that, please feel free. My thought was that maybe it could be done similarly to how the reverse/anti prompt is done (otherwise known as stopping strings) or how JSON schema is done in the CLI, but I wasn't really sure how to pull it off correctly.

Oct 01 '24 09:10 wwoodsTM

@compilade You had kindly offered possibly doing a code review for the previous iteration of this PR before the sampler API overhaul. If you still have availability for that, I would definitely love to get your feedback and insights on this newest iteration.

Oct 02 '24 22:10 wwoodsTM

@compilade Thanks for taking a look! If it’s any help the koboldcpp version of this has a lot of extra comments that go more in-depth on the intricacies of the Z-algorithm and the clever way they implement it. Most of the core implementation here is exactly the same except for the bits I had to change due to the ring_buffer accessing from the end.

Oct 04 '24 01:10 wwoodsTM

KoboldCPP's handling of sequence breakers is excellent, and it's great to see it ported here.

However, the Z-algorithm is actually unnecessary complexity as I had remarked in the Kobold PR. Quadratic runtime can be avoided by simply capping the maximum sequence length being matched (to 50, in the case of text-generation-webui). The penalty is exponential in the match length, so this makes no difference in practice as the value will be astronomically large regardless. This can simplify the algorithm considerably, and has been tested to perform well even in the worst-case scenario where the input consists entirely of repetitions of a single token.

Personally, I prefer simplicity over notions of algorithmic elegance if performance is demonstrably equivalent as in this case, although I accept and respect that not everyone shares that opinion.

Either way, great to see this happening! Make sure you run some basic smoke tests before merging this in, e.g. using the input from the "Demonstration" section of my original PR, to verify the system is actually doing what it is supposed to, since the implementation is fairly complex at this point.

Oct 05 '24 03:10 p-e-w

Personally, I prefer simplicity over notions of algorithmic elegance if performance is demonstrably equivalent as in this case, although I accept and respect that not everyone shares that opinion.

@p-e-w I really appreciate your feedback and am a fan of not just this sampler but also XTC, which I was happy to see at least two people beating me to the punch on as far as implementing with this new sampler API (albeit one of the submissions being made through reddit rather than a PR).

As for the issues you raise about complexity, you bring up very fair points. I will admit that I do tend to be a sucker for "elegant" algorithms or ones that seem particularly clever to me. Having said that, I would not have been opposed to implementing it the less complex way but didn't realize you weren't a fan of the z-algorithm portion until I was already pretty far into the process and saw your comments while digging deeper into all the PRs.

Either way, great to see this happening! Make sure you run some basic smoke tests before merging this in, e.g. using the input from the "Demonstration" section of my original PR, to verify the system is actually doing what it is supposed to, since the implementation is fairly complex at this point.

In regard to your demonstration text, that was indeed one of the first things I tested it on! For anybody else looking at this, here is the first part of it that is fed to the LLM (in this case, I fed it to Nemo):

Detective: Where were you last night at 6 PM?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Did you know the victim personally?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have money problems?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have a criminal record?

Suspect: On the advice of my attorneys, I invoke

Here is what Nemo gave me with DRY off when I ran it just now:

 my Fifth Amendment right to not answer that question.

Detective: What did you do for a living?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Have you ever been arrested?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Are you married?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

...plus 12 more successive repetitions of this pattern as I had banned EOS...

Here is what Nemo gave me with this implementation of DRY turned on (multiplier: 0.8, base: 1.75, allowed_length: 2, default sequence breakers which include ":"):

 —

Detective: Wait, you're not going to invoke your Fifth Amendment rights on that question?

Suspect: No. I don't have a criminal record.

...plus many more lines, none having that pattern...

Here is some relevant debug output showing the final tokens of the prompt, as well as the penalty itself and a comparison of logits before and after:

109 | 72859  | Sus        
110 | 2117   | pect       
111 | 1058   | :          
112 | 3249   |  On        
113 | 1278   |  the       
114 | 19199  |  advice    
115 | 1307   |  of        
116 | 2036   |  my        
117 | 101642 |  attorneys 
118 | 1044   | ,          
119 | 1362   |  I         
120 | 45589  |  invoke    

Applied penalty 40.2121 to token 2036 ( my) (repeat length 9)

Top 10 candidates before penalties:
Token 2036 ( my): logit = 23.2354
Token 1674 (—): logit = 14.6016
Token 9283 (...\n\n): logit = 14.5702
Token 1278 ( the): logit = 14.3120
Token 38659 ( ...\n\n): logit = 13.9730
Token 2251 ( —): logit = 13.8171
Token 2880 (...): logit = 12.9741
Token 57725 ( Fifth): logit = 12.7038
Token 2247 ( .): logit = 12.5666
Token 10260 (…\n\n): logit = 12.5319

Top 10 candidates after penalties:
Token 1674 (—): logit = 14.6016
Token 9283 (...\n\n): logit = 14.5702
Token 1278 ( the): logit = 14.3120
Token 38659 ( ...\n\n): logit = 13.9730
Token 2251 ( —): logit = 13.8171
Token 2880 (...): logit = 12.9741
Token 57725 ( Fifth): logit = 12.7038
Token 2247 ( .): logit = 12.5666
Token 10260 (…\n\n): logit = 12.5319
Token 1536 ( by): logit = 12.3046

Oct 05 '24 08:10 wwoodsTM

@wwoodsTM

Fair enough, as long as everything works it should be fine, and since you've already implemented it I guess there's little point in changing it now. The elephant in the room with such decisions is usually maintainability, but to be frank I doubt this code will need to be touched very often.

BTW, I love the output you got with the demonstration example. It shows not only that DRY is capable of stopping repetition (which is a mathematical certainty in principle), but that the model recovers organically from the intervention, creating a natural transition to a completely different train of thought. Many people have told me that they expect DRY to just make the model paraphrase the previous output instead of repeating it verbatim, or introduce spelling mistakes, but your output demonstrates that this usually doesn't happen with current-gen models. And it's even less likely to happen in actual conversations, because in such scenarios, DRY gradually increases the penalty rather than just slamming the door because the user already imposed a 9-token repetition like in this example.

Oct 06 '24 03:10 p-e-w

@compilade I went ahead and made several small fixes that hopefully should address the strict compile warnings/errors I saw from the workflows. Incidentally, I also removed a "cstring" include in sampler.cpp that I had added earlier but that is no longer needed, and also removed the redundant line in server.cpp I had commented on earlier (and resolved that comment).

Thanks again and let me know if I can do anything else to help the process.

Oct 07 '24 06:10 wwoodsTM

It looks like @compilade has been showing a busy status since the weekend and I’m guessing has a pretty full plate at the moment. I look forward to seeing any feedback he may have, but I also wanted to extend an invitation for feedback to any other maintainers who may have a moment as well?

@slaren or @ggerganov, with your recent sampling-focused work/review, would either of you have a chance to take a look? I am excited about this being so close to the finish line and I have seen great results from all the testing I have done. Thank you!

Oct 07 '24 15:10 wwoodsTM

There is a lot of code here that makes this change hard to review. It would be easier to do so if all the debug code was removed. Functions like llama_tokenize in llama-sampling.cpp should probably be removed as well, or moved elsewhere.

Oct 11 '24 06:10 slaren

@slaren Thank you for the feedback!

[x] I just removed all of the debugging code that had been between the #ifdefs.
[x] I also moved the tokenize/detokenize functions to llama-impl.h

Despite it not being a strict interface function, I left the GetOverlappingTokenSequences in llama-sampling.cpp since it's an integral part to processing the sequence breakers. If you think I should also move that out or rename it to make the connection to DRY more obvious though, I can do that too.

About an hour ago I resolved the recent conflicts that arose from #9805. Prior to that, all of the CI workflows that @compilade triggered were showing as passing.

Oct 11 '24 08:10 wwoodsTM

We now have the explicit permission from @pi6am to base this version on his implementation from koboldcpp. I also added him as a co-author on earlier commits for this PR.

I believe I had addressed all of the concerns from @compilade's code review in my latest commit, but I of course welcome any and all additional feedback. @ggerganov @slaren

Thanks.

Oct 18 '24 20:10 wwoodsTM

@slaren I implemented your suggestions and fixed a conflict with the latest version of master.

Continuing the conversation on the sequence breaker functions: While I had only meant these to be a "part 2" of the init process, I can see how this was not clear and how regardless of intention they don't mesh well with the design.

To address this, I came up with a way of getting rid of the sequence breaker functions altogether which I just implemented in the latest commit. This new approach uses one main init function which includes the sequence breakers as a parameter, and then I made two overloaded wrapper versions, one for the C-interface and another for test-sampling.cpp.

Does this approach seem workable?

Oct 19 '24 08:10 wwoodsTM

Why no take char** instead and make it part of the public API?

Oct 19 '24 15:10 slaren

@slaren

Why no take char** instead and make it part of the public API?

So as @compilade mentioned before, I preferred the flexibility of C++ strings to allow for the possibility of NUL characters as I wasn't sure whether there might be some contexts where this literal "sequence breaker" for C-strings might be used as a sequence breaker for DRY. I also wasn't sure whether there might be any other weird encoding things that might crop up with C-strings. But maybe the chances are quite remote of this?

Also, in my particular use case, I use llama-server and pass sequence breakers through JSON, and so this probably biases me a bit toward thinking of that as kind of the principal interface versus the C-interface.

As a side note, I just made an additional commit to address warnings-made-errors from the pedantic runs that I had missed.

Oct 19 '24 19:10 wwoodsTM

I am not familiar with this sampler, and I am not sure what kinds of strings are used as sequence breakers, but it seems very unlikely that allowing NUL characters is important. If it is really necessary, it can be done by passing an additional array with the sizes of each string.

Oct 19 '24 19:10 slaren

@slaren

Do you feel this is a deal-breaker in its current form? My feeling is that at present it is a decent compromise to allow for keeping things in pure C++ string form where possible, such as when the breakers are submitted through JSON and tokenized in get_overlapping_token_sequences, as opposed to converting to C strings and then converting back to C++ strings every time. In the current implementation, sequence breakers would only be converted from C-strings on an as-needed basis when they are specifically submitted through the C-interface.

Oct 19 '24 22:10 wwoodsTM

My opinion is that it is not worth the maintenance cost of having "secret" API functions, especially for something as minor as this. It takes two lines of code to convert a vector of std::string to a vector of const char *, and it only needs to be done once in sampling.cpp.

Oct 19 '24 23:10 slaren

I hear you, it’s just that C-strings have a history of sending shivers down my spine and making me curse in repetitive ways that would probably be penalized by this sampler. I was hoping to avoid converting the overlap code from kobold to use them, but honestly it shouldn’t be hard since we’re not really manipulating the strings or doing anything complicated with them, so I will get over my fear of them and make the change a bit later today when I get home.

Thanks again for your feedback.

Oct 19 '24 23:10 wwoodsTM

I am not sure if I understand correctly, but it is ok to store the strings in the sampler state as std::string, there is no cost to doing that, since you need to make a copy of the strings anyway to store them (can't just steal the pointers from the application). So hopefully changing this does not require converting the overlap code.

Oct 19 '24 23:10 slaren

I am not sure if I understand correctly, but it is ok to store the strings in the sampler state as std::string, there is no cost to doing that, since you need to make a copy of the strings anyway to store them (can't just steal the pointers from the application). So hopefully changing this does not require converting the overlap code.

Ok, I have gone ahead and eliminated the C++ version of llama_sampler_init_dry. @slaren, as you suggested, I left the overlap function alone and basically did a std::string --> C-string --> std::string conversion process to smuggle the breakers through the C-interface.

Oct 20 '24 08:10 wwoodsTM

I have implemented all suggestions so far into the latest commit, including the use of llama_tokenize_internal as well as eliminating the support for JSON-within-JSON for passing sequence breakers. I also posted a message on SillyTavern's code-contribution channel in Discord and @Cohee1207 confirmed what @ngxson also pointed out here concerning TextGen WebUI being the origin of this format. @Cohee1207 expressed openness to a llama.cpp specific conversion so that ST would be compatible with this stricter implementation.

Oct 20 '24 10:10 wwoodsTM

@ggerganov I transitioned to using llama_vocab as suggested. This also allowed me to use "vocab.n_vocab" directly in the overlap function. I do not hold onto vocab in the sampler context as it is just needed for initialization and processing of the sequence breakers since I moved all of that into the main init function itself.

I also changed the specialized llama_detokenize function I was using to take llama_vocab now and moved it to llama-vocab.cpp and restored llama-impl.h to its previous state.

Oct 21 '24 04:10 wwoodsTM

Sorry for all the commits and revisions that were necessary, and I appreciate everybody's patience and feedback through this process.

I just merged the latest changes from master, including the overhaul of test-sampling.cpp from several hours ago. I also made all of the recent suggested changes, including using llama_n_ctx_train(model) and the cleaner use of vocab.

Overall, this PR is in a much leaner and cleaner state because of everybody's valuable feedback, and thank you to everyone who offered it. Please let me know if I can do anything else, but I am definitely happy with how far it's come.

Oct 21 '24 23:10 wwoodsTM

The sampler implementation looks good to me. I am not sure about the changes to the server, at least that will need to be cleaned up, but I will leave that review to @ngxson and @ggerganov.

Thank you for taking a look.

As for the server code, I wanted to leave things commented just to be very clear where the changes were made until someone confidently said it was good to go. What I can say is that the movement of the sampling code in server.cpp definitely seemed to resolve that major bug that was reported before and that I discussed in the original post for the PR. And just to add some info, this bug was honestly worse than what was originally reported as it would not even pass the full prompt on further generations and would always be behind in handing tokens to the sampler system. As it stands, this process seems to now work smoothly and passes the complete prompt each time, but I am not sure about all use-cases or whether that somehow impacts something else inadvertently.

Oct 21 '24 23:10 wwoodsTM

@p-e-w Thank you for all your many good catches there and your suggestion for the --dry-sequence-breaker singular CLI argument which I have implemented.

Oct 22 '24 09:10 wwoodsTM

I have removed the mixed arrays due to that issue being out of scope of the PR, as requested by @ngxson.

To my knowledge, the only pending thing at this point is the question above concerning whether I am OK keeping my bandaid fix in the PR as a bridge to a more proper solution.

Oct 23 '24 03:10 wwoodsTM

@ggerganov @ngxson The separate PR #10019 for the server sampling code is up and just passed all checks. I removed all of the commented-out code from it and it left it with just the precise removals and additions needed. I also went ahead and resolved all the comments up above to reflect the current status. Thank you.

Oct 23 '24 18:10 wwoodsTM

Not sure what I did wrong with the "co-authored" part in the first force-pushes as GitHub didn't seem to want to recognize that; maybe it deals with force-pushes differently in that respect? Anyway, I removed an extra empty line that I probably should've removed originally anyway and did a normal commit to make sure that @l3utterfly and @pi6am were restored as co-authors on this PR.

Oct 23 '24 23:10 wwoodsTM

@ggerganov @slaren @ngxson

Is it possible my force-pushes and commit all in quick succession threw off the tests and that's why we're getting failures?

Oct 23 '24 23:10 wwoodsTM

Some of the builds have warnings as errors, it fails to build here:

examples/server/server.cpp:908:18: error: unused variable 'dry_sequence_breakers' [-Werror,-Wunused-variable]

Oct 24 '24 00:10 slaren

Some of the builds have warnings as errors, it fails to build here:

examples/server/server.cpp:908:18: error: unused variable 'dry_sequence_breakers' [-Werror,-Wunused-variable]

Thank you! Just recommitted.

I thought I had my warnings and pedantic settings all cranked to the max but somehow that one didn't show for me.

Oct 24 '24 00:10 wwoodsTM

llama.cpp llama.cpp copied to clipboard

added implementation of DRY sampler (post-refactor)

llama.cpp
llama.cpp copied to clipboard