llama.cpp
llama.cpp copied to clipboard
Sample interface, new samplers,
ignore EOS should apply -inf to EOS logit.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Nice work!
I'll link the literature here. Feel free to complete with more up do date sources.
- CTRL paper for the repetition penalty currently used in llama.cpp
- Frequency and Presence penalties
- locally typical sampling
- tail free sampling
- mirostat
I like the idea of a modular interface for sampling. It enables each code sample and application to combine these parts to do its own kitchen-sink sampling that fits their needs. Going further with this, the llama.h interface could be stripped to only provide access to logits and vocabulary, and the sampling code moved to a separate object file. This would emphasize and guarantee the extensibility of the samplers.
I am hesitant about the current implementation of repetition penalization. As an illustration, I question whether the occurrence of past newlines and punctuation should guide the sampling of the following tokens. Attempting to fix this, the repetitions could be weighed against a simple frequency model. However, I wasn't able to recover such frequencies from the tokenizer weights. It's possible to gather more information by measuring the length of the repetition that the next token would complete or interrupt. I have implemented this idea and an exponential decay.
Concerning the application of the penalization, I'm not sure whether it is better to offset the logits or to scale them. Subtracting to the logit, used by "frequency and presence penalty", amounts to scaling the probabilities. Scaling the logits, which is discussed in the CTRL paper, can be thought of as a way of raising probabilities to a power, but is dependent on the logit=0 point which is not particularly meaningful. Your current implementation applies both methods successively, which seems redundant.
I haven't found the time to read in details about mirostat. My limited knowledge tells me that as the number of parameters goes up, the method becomes more challenging to apply in practice. Additionally, it seems difficult to control the changing target surprise mu
using feedback, especially when working with an auto-regressive model. On the other hand, the promise of avoiding repetitions and boredom traps without looking at past tokens is very interesting.
I found that it is quite difficult to evaluate the sampling algorithms. We have good starting points with your analysis, the information-theoretic formalism of the locally typical sampling and mirostat papers, and their evaluation methods. Doing such experiments takes time end effort. Also, large scale human evaluations are next to impossible without a large community effort.
The CTRL paper does not mention, but in fact, the CTRL repository explicitly avoids penalizing newline tokens during sampling.
Rebased, added 2 commits since last review
Mark "ready for review" when you think it is good for merge
I do not have Windows machine with MSVC installed, I am not sure why it fails:
3: Test command: D:\a\llama.cpp\llama.cpp\build\bin\Release\test-sampling.exe
3: Working Directory: D:/a/llama.cpp/llama.cpp/build/tests
3: Test timeout computed to be: 1500
3/4 Test #3: test-sampling ....................***Exception: Numerical 0.01 sec
Ready for review
very cool. I always wanted a way to blacklist tokens, like backslash.
Oh, I got it, for \code{begin}
!
Oh, I got it, for
\code{begin}
!
yea :smile: and \code{end}
, the model often emits this before eos or tries do dodge/end the conversation.
Already tested it, works great.
edit: its -l 29905-100000
, if anyone is interested.
You could write -l 29905-inf
😊
I have used stof instead of stringstream just to make "inf"
work
Any thoughts on the removal of parameter defaults of new sampling function to keep llama.h compatible with C/Obj-C?
edit: its
-l 29905-100000
, if anyone is interested.
Could anyone please share how to get the token id, and could I pass multiple tokens at once with the --logit-bias flag?
@DenisSergeevitch you can supply --verbose-prompt
--verbose-prompt print prompt before generation
eg:
$ bin/main --verbose-prompt -m ../models/open_llama_7b_preview_300bt/ggml-model-q4_0.bin -p "Test prompt"
...
main: prompt: ' Test prompt'
main: number of tokens in prompt = 3
1 -> ''
5073 -> ' Test'
7593 -> ' prompt'
...
pass multiple tokens at once
Yes, by passing multiple arguments, like ./main ... -l 2-inf -l 13+2 -l 228+5
.
pass multiple tokens at once
Yes, by passing multiple arguments, like
./main ... -l 2-inf -l 13+2 -l 228+5
.
Thanks, I have done a small uncesoring method based on this flag, works like a charm!
@ivanstepanovftw I'm working on a Rust-based implementation of these samplers and using the code you wrote as a reference. I'm crediting the llama.cpp project but I can mention by name in the project README as well since you wrote it (and I don't think it's really been changed much since the initial commit). I didn't want to do something like that without asking first, though.
Also, if you're unhappy with the way I'm handling this (the credits or otherwise) please let me know and hopefully we can work something out!
Link: https://github.com/KerfuffleV2/llm-samplers/
@KerfuffleV2 Sure you can! Glad that you support RWKV, looks very promising.