llama.cpp grammars: x{min,max} repetition operator

grammars: x{min,max} repetition operator

Open ochafik opened this issue 4 months ago • 17 comments

Add bounded repetition operators x{n}, x{,n}, x{m,n}, x{m,} to GBNF (unified w/ + / * / ?), and update JSON schema converters to use them.

Also improved the parser test w/ support to pretty print expectations for easier update.

./main \
  -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
  --grammar 'root ::= "a"{,100}' \
  -p "A very long aaaaaa sequence: " \
  --no-display-prompt --log-disable

# aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Show branch checkout instructions

git clone https://github.com/ochafik/llama.cpp --branch grammar-reps --single-branch llama.cpp-grammar-reps
cd llama.cpp-grammar-reps
make clean && make -j LLAMA_CURL=1 main

Notes:

https://github.com/ggerganov/llama.cpp/pull/6467 would play nicely with this: --grammar 'root ::= .{10,100}' to generate between 10 and 100 chars.
While https://github.com/ggerganov/llama.cpp/pull/6555 already handles fast repetitions in JSON, ~~it should be more efficient to handle it in the grammar itself~~ (edit: no measurable performance improvement https://github.com/ggerganov/llama.cpp/pull/6640#issuecomment-2067445163 but generated grammars are much simpler) (+ benefits non-JSON use cases)
This makes it easier to write efficient repetition constraints, and unlocks implementation of upcoming features such as min/max values in json schemas

Rewrite rules

Used to be:

S* --> S' ::= S S' |
S+ --> S' ::= S S' | S
S? --> S' ::= S |

Now it's:

S{m,n} --> S S S (m times) S'(n-m)
           S'(x)   ::= S S'(x-1) |
           S'(1)   ::= S |
S{m,} -->  S S S (m times) S'
           S'     ::= S S' |
S*     --> S{0,}
S+     --> S{1,}
S?     --> S{0,1}

Which means in practice * / ? don't change but + does:

S*     --> S'     ::= S S' |
S+     --> S S'
           S'     ::= S S' |
S?     --> S'     ::= S |

TODO before undrafting:

[x] Add exact {n} repetition
[x] Update GBNF doc
[x] Add tests
[x] Simplify JSON schema repetitions (https://github.com/ggerganov/llama.cpp/pull/6555) to use new operator
[x] Benchmarked (no change for + / * / ?)

cc/ @HanClinto (thanks for casting doubts on the rules rewrite in https://github.com/ggerganov/llama.cpp/issues/4218#issuecomment-2042985661 !) cc/ @ejones

Apr 12 '24 14:04 ochafik

I like the pretty print options that you've put in here. One thing that we might want to consider is to port your pretty-print functions to gbnf-validator.cpp so that people can pretty-print arbitrary grammars?

Apr 12 '24 16:04 HanClinto

cc/ @HanClinto (thanks for casting doubts on the rules rewrite in https://github.com/ggerganov/llama.cpp/issues/4218#issuecomment-2042985661 !)

haha -- for sure! If nothing else, hopefully I'm good at casting doubt. :)

Daydreaming about this problem, and I'm pondering tackling it a completely different way.

An alternate approach I was dreaming about is adding a new operator type to llama_gretype that is something like LLAMA_GRETYPE_REPEAT_N, and the value of the llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | ))

{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 2},
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 3},
{LLAMA_GRETYPE_ALT, 0},
{LLAMA_GRETYPE_END, 0}

Or something like that.

Then when building the stacks in llama_advance_grammar, the REPEAT_N operator would essentially function like a rule ref, except that it would hydrate the previous rule and if the repeat .value is > 1, then append another REPEAT_N operator with the value-1.

Not sure if I'm making sense or not, but what I like about this is that (hopefully) wouldn't need to hydrate specific rules for repeat = 1000 -- we could just set .value to 1000 and let it advance a bit more intelligently.

No idea if this would work or not, but this is a very loose daydream of an idea that has been flitting around in my head for a few days, and I wanted to write it down before it escaped again. :)

I've not made any progress on doing a POC for this or anything -- I'm just wondering about the feasibility of this approach right now.

Apr 12 '24 17:04 HanClinto

`llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

@HanClinto I did wonder about something like this, I think it would work and the stack might just need to be made of a (edit) vector of element reference + times it's been repeated.

I wonder how much better it would perform, maybe it would allow breaking the new 100k repetition barrier?

On my side I'm probably done for this week, but keen to explore the head set optimization route later (cf. https://github.com/ggerganov/llama.cpp/issues/4218#issuecomment-2049604659 ; incubating here, which is mostly painful refactoring & prep work, next step is to update the stack with head-plausible alternatives only when accepting a char).

Edit: half of me is hoping you'll have implemented this alternative approach by the end of the weekend, the other half is hoping my silly head set code won't have bitrotten too much with fancy new alternative logic :-p

Apr 12 '24 17:04 ochafik

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | )) {LLAMA_GRETYPE_CHAR, 'a'}, {LLAMA_GRETYPE_REPEAT_N, 2}, {LLAMA_GRETYPE_CHAR, 'a'}, {LLAMA_GRETYPE_REPEAT_N, 3}, {LLAMA_GRETYPE_ALT, 0}, {LLAMA_GRETYPE_END, 0}

@HanClinto Might want to add different types for repeat exactly N, repeat unbounded, and repeat up to N times.

Apr 12 '24 17:04 ochafik

`llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

@HanClinto I did wonder about something like this, I think it would work and the stack might just need to be made of an element reference + times it's been repeated.

Okay -- I think that's probably enough of an encouragement for me to at least try taking a stab at it this weekend.

I wonder how much better it would perform, maybe it would allow breaking the new 100k repetition barrier?

Yeah, that's what I'm dreaming of! :)

On my side I'm probably done for this week, but keen to explore the head set optimization route later (cf. #4218 (comment) ; incubating here, which is mostly painful refactoring & prep work, next step is to update the stack with head-plausible alternatives only when accepting a char).

This sounds really hopeful! I don't think I'm as familiar with classical grammar optimizations as you are, but this sounds really reasonable. Looking through the code, I think that there is a LOT of possible optimization work to be done in llama_grammar_reject_candidates / llama_grammar_reject_candidates_for_stack, which I think is the sort of thing that you're talking about? Would that be akin to taking the call to llama_grammar_advance_stack that's inside llama_grammar_reject_candidates_for_stack and passing in the list of candidates, so that we only advance grammar with stacks that could match the candidates...?

I could be seriously off base, but regardless, I'm excited to see what you have in store -- especially if it's a generally proven optimization commonly applied to parsers. :)

Edit: half of me is hoping you'll have implemented this alternative approach by the end of the weekend, the other half is hoping my silly head set code won't have bitrotten too much with fancy new alternative logic :-p

haha -- fair enough. :D I'll take a stab at it and see how far I get. If I can't get something working in a weekend, it might be a deadend, so we'll see what Monday brings.

BTW, hacking on this with you has been a TON of fun -- thank you!!

Apr 12 '24 17:04 HanClinto

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | )) {LLAMA_GRETYPE_CHAR, 'a'}, {LLAMA_GRETYPE_REPEAT_N, 2}, {LLAMA_GRETYPE_CHAR, 'a'}, {LLAMA_GRETYPE_REPEAT_N, 3}, {LLAMA_GRETYPE_ALT, 0}, {LLAMA_GRETYPE_END, 0}

@HanClinto Might want to add different types for repeat exactly N, repeat unbounded, and repeat up to N times.

The way I'm imagining it, LLAMA_GRETYPE_REPEAT_N would mean "repeat exactly N times" (with a possible special case where value == 0 means to repeat unbounded).

I was originally imagining that repeating up to N times would be handled with LLAMA_GRETYPE_REPEAT_N with value == N, and a blank alternate rule to follow (similar to how S* --> S' ::= S S' | so S{0,N} --> S' ::= S{N} |) -- but now that you say this, I realize that would mean "a"{2,5} would equate to exactly 2 copies or exactly 5 copies -- but not 3 or 4. You're right -- I might need a second token that means "up to N".

Thank you for helping me think this through! I'm still not confident that it's going to work, but I'm increasingly confident that it would be worth exploring.

Regardless, unless I'm really off base with my understanding of what the head set optimization involves, I don't imagine that this new token would step on those toes at all, and I think that's worth exploring independently.

Apr 12 '24 17:04 HanClinto

Okay -- I think that's probably enough of an encouragement for me to at least try taking a stab at it this weekend.

Yayyy, good luck!

I don't think I'm as familiar with classical grammar optimizations as you are, but this sounds really reasonable.

I wouldn't say I'm in such a familiar territory, but that one is literally the only thing I can remember from the compilers class I had in Uni (also, my assignment was buggy beyond hell, so...)

Looking through the code, I think that there is a LOT of possible optimization work to be done in llama_grammar_reject_candidates / llama_grammar_reject_candidates_for_stack

(Ugh, this one and its vectors just look like a ripe fractal fruit looking deceitfully low-hanging, yet a bit elusive)

which I think is the sort of thing that you're talking about? Would that be akin to taking the call to llama_grammar_advance_stack that's inside llama_grammar_reject_candidates_for_stack and passing in the list of candidates, so that we only advance grammar with stacks that could match the candidates...?

Yes, tbh I'm still not convinced it will work, but essentially today we populate the possible next stacks before consuming chars (so we expand all possible alternatives), the idea would be to only expand alternatives that start w/ any of the first char of the candidates, and/or to reject candidates that don't start with the union of the head sets of all the next things on the stack. Something 🔁 with chickens and eggs.

BTW, hacking on this with you has been a TON of fun -- thank you!!

Likewise, and thanks to you too! Looking forward to continued fun!!

Apr 12 '24 18:04 ochafik

Okay -- I think that's probably enough of an encouragement for me to at least try taking a stab at it this weekend.

Yayyy, good luck!

Thanks! I'll need it... :D

Looking through the code, I think that there is a LOT of possible optimization work to be done in llama_grammar_reject_candidates / llama_grammar_reject_candidates_for_stack

(Ugh, this one and its vectors just look like a ripe fractal fruit looking deceitfully low-hanging, yet a bit elusive)

Yeah, seriously. I follow the code mentally where llama_grammar_reject_candidates calls llama_grammar_reject_candidates_for_stack, but then llama_grammar_reject_candidates_for_stack accepts the grammar and calls llama_grammar_reject_candidates again and I get all turned around. I've been looking at this code for weeks now, and it still makes my brain feel like spaghetti.

which I think is the sort of thing that you're talking about? Would that be akin to taking the call to llama_grammar_advance_stack that's inside llama_grammar_reject_candidates_for_stack and passing in the list of candidates, so that we only advance grammar with stacks that could match the candidates...?

Yes, tbh I'm still not convinced it will work, but essentially today we populate the possible next stacks before consuming chars (so we expand all possible alternatives), the idea would be to only expand alternatives that start w/ any of the first char of the candidates, and/or to reject candidates that don't start with the union of the head sets of all the next things on the stack. Something 🔁 with chickens and eggs.

Yeah, this feels really solid.

I am still trying to learn more about how sampling works, but apparently sampling doesn't always take the token with the highest logit score, because it's sampled via distribution with something to do with temperature and blah blah blah?

https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L12753

It feels like if we're doing greedy sampling (where we always take the token with the highest sample no matter what), then there's opportunity for optimization where we sort the entire candidates list by logit score (expensive, since this list is big) -- but then we pass that sorted candidates list into our reject_candidates function (which might now need to have its logic inverted to be an accept_candidates function?) to find the first token with at least one matching stack, and then as soon as it finds it, do an early exit so that we don't bother comparing all of the other candidates against every stack (which, best I can tell, is what llama_grammar_reject_candidates does right now -- at least if we're in the is_resampling pass (c.f. #4306 ))

Sorting all candidates by their logit score is expensive, but hopefully the gains of being able to early exit (and not compare every candidate against every stack) would far outweigh that cost.

Anyways, that all requires me learning more about the various sampling methods, so I haven't fully dug into that yet. :)

BTW, hacking on this with you has been a TON of fun -- thank you!!

Likewise, and thanks to you too! Looking forward to continued fun!!

Huzzah! This is so satisfying -- it's scratching all the same itches that I get when I play things like Spacechem or Infinifactory, but shaving a few cycles off of my solutions in those puzzles doesn't mean as much as this does. We may be optimizing niche and edge cases, but I truly believe that reliable output of LLMs will be one of the keys to unlock even more productivity for democratized AI, so I get a big charge out of seeing our performance numbers tick up -- PLUS it's a ton of fun. :)

Apr 12 '24 18:04 HanClinto

(Ugh, this one and its vectors just look like a ripe fractal fruit looking deceitfully low-hanging, yet a bit elusive) Yeah, seriously. I follow the code mentally where llama_grammar_reject_candidates calls llama_grammar_reject_candidates_for_stack, but then llama_grammar_reject_candidates_for_stack accepts the grammar and calls llama_grammar_reject_candidates again and I get all turned around. I've been looking at this code for weeks now, and it still makes my brain feel like spaghetti.

Yeah I still don’t understand this bit ((how) does rejects rejection… work?), but your stack deduping should now make it practical to follow step by step. Maybe pretty print the state and candidates at every stack change with a simple grammar, or something.

Anyways, that all requires me learning more about the various sampling methods, so I haven't fully dug into that yet. :)

Found this page to be a great intro (then the mirostat paper maybe)

It feels like if we're doing greedy sampling (where we always take the token with the highest sample no matter what), then there's opportunity for optimization where we sort the entire candidates list by logit score (expensive, since this list is big)

Expensive but maybe not so much compared to the GPU-burning rest of inference.

One idea I’ve toyed with is to restrict the number of logits actually computed at the first place. With grammar constraints (or, say, ASCII-only output constraints) only some of the tokens are acceptable, so one could do a sparse matrix multiplication at the last stage of the transformer to only compute logits that we might care about. But not sure if GGML has indexed / sparse matrix mult support yet, and getting the list of acceptable tokens would need to be fast to compute in the dynamic grammar case (precomputing the valid chars / tokens for each rule / stack state would be key here).

but then we pass that sorted candidates list into our reject_candidates function (which might now need to have its logic inverted to be an accept_candidates function?) to find the first token with at least one matching stack, and then as soon as it finds it, do an early exit so that we don't bother comparing all of the other candidates against every stack (which, best I can tell, is what llama_grammar_reject_candidates does right now -- at least if we're in the is_resampling pass (c.f. https://github.com/ggerganov/llama.cpp/pull/4306 ))

I think you’re ahead of me in understanding that code, I’ll have to dive into it again. Maybe we need a live session with sketches in Freeform 🙋‍♂️

This is so satisfying -- it's scratching all the same itches that I get when I play things like Spacechem or Infinifactory, but shaving a few cycles off of my solutions in those puzzles doesn't mean as much as this does.

Haha, I’ve got enough of a programming addiction already, managed to stay away from games but these look fun!

We may be optimizing niche and edge cases, but I truly believe that reliable output of LLMs will be one of the keys to unlock even more productivity for democratized AI, so I get a big charge out of seeing our performance numbers tick up -- PLUS it's a ton of fun. :)

Yeah performance optimization is very rewarding, but we should be attentive to diminishing returns here (hence the importance of testing on real world grammars). My goal has been to make llama.cpp agents w/ grammar-constrained tool-calling as fast as possible (https://github.com/ggerganov/llama.cpp/pull/6389) so I can automate myself away, trying to keep eyes on the prize 👀

Apr 13 '24 11:04 ochafik

@HanClinto I did a first bit of benchmarking / comparison of outputs on @AlienKevin's grammar from https://github.com/ggerganov/llama.cpp/issues/4218#issuecomment-1836540046 on this PR and it seems to be ~~8% faster~~ (edit) 5% slower than master on my Mac.

Here's the instructions (download issue4218.gbnf and issue4218.txt):

( export COMMON_ARGS=(
    -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
    --prompt-cache issue4218.bin
    --grammar-file issue4218.gbnf
    -f issue4218.txt
    -c 3400
  ) && \
  hyperfine --warmup 1 --runs 10 \
    -L branch grammar-reps,master \
    --setup "\
      git checkout {branch} && \
      make clean && make -j LLAMA_CURL=1 main && \
      rm -f issue4218.bin && \
      ./main ${COMMON_ARGS[*]} -n 1" \
    "BRANCH={branch} \
      ./main ${COMMON_ARGS[*]} -n 256 --prompt-cache-ro --seed 12345 --no-display-prompt" \
)

I'll try and run more benchmarks will revert the changes for + / * / ? if this regression is confirmed.

(Interestingly, that same grammar makes master to segfault with sequences that are a bit too long, will probably investigate / report separately ; edit seems to be caused by -ctk q4_k regardless of grammar --> https://github.com/ggerganov/llama.cpp/issues/5652)

Apr 15 '24 19:04 ochafik

Great benchmarking and sleuthing, @ochafik !!

I'll try and run more benchmarks will revert the changes for + / * / ? if this regression is confirmed.

Yeah, would be good to get a few more cases before we go too hard on this. I've got a brutal SQL SELECT grammar that is pretty heavyweight that I'll dig up and hopefully that will help put this through its paces as well.

Really good stuff!

Apr 16 '24 02:04 HanClinto

Here's the instructions (download issue4218.gbnf and issue4218.txt):

Thank you VERY much for such a detailed benchmark repro -- I was able to run this very easily!

@HanClinto I did a first bit of benchmarking / comparison of outputs on @AlienKevin's grammar from #4218 (comment) on this PR and it seems to be ~8% faster~ (edit) 5% slower than master on my Mac.

I also got slightly faster performance on master than on your branch with that test benchmark:

  BRANCH=master       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 256 --prompt-cache-ro --seed 12345 --no-display-prompt ran
    1.09 ± 0.14 times faster than BRANCH=grammar-reps       ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --prompt-cache issue4218.bin --grammar-file issue4218.gbnf -f issue4218.txt -c 3400 -n 256 --prompt-cache-ro --seed 12345 --no-display-prompt

Even so, even if we should revert the rewriting of *, +, ?, then the other gains you've made with the addition of the {m,n} operator is still super worth it!

Apr 16 '24 03:04 HanClinto

@HanClinto did you get a chance to fiddle w/ an alternative repetition implementation? I'm itching to try something tonight tbh :-D

Apr 17 '24 18:04 ochafik

@HanClinto did you get a chance to fiddle w/ an alternative repetition implementation? I'm itching to try something tonight tbh :-D

I did not -- things got crazy for me over the weekend, and it's unlikely I'll get to it for the next few days (I'm helping coach a middle-school robotics team and they have their big competition for the next 3 days starting tomorrow).

Go for it!!

Apr 17 '24 20:04 HanClinto

I'm helping coach a middle-school robotics team and they have their big competition for the next 3 days starting tomorrow

🤞🤖🦾

Apr 17 '24 21:04 ochafik

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 456 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10290.2ms p(95)=26528.7ms fails=, finish reason: stop=402 truncated=54
Prompt processing (pp): avg=119.58tk/s p(95)=574.39tk/s
Token generation (tg): avg=23.88tk/s p(95)=35.25tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=grammar-reps commit=2ecc2ae90020c4ab5863ab8ce090d3de44c5268f

prompt_tokens_seconds

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 456 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713572968 --> 1713573596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 646.07, 646.07, 646.07, 646.07, 646.07, 350.65, 350.65, 350.65, 350.65, 350.65, 370.8, 370.8, 370.8, 370.8, 370.8, 387.9, 387.9, 387.9, 387.9, 387.9, 434.68, 434.68, 434.68, 434.68, 434.68, 453.0, 453.0, 453.0, 453.0, 453.0, 456.52, 456.52, 456.52, 456.52, 456.52, 461.74, 461.74, 461.74, 461.74, 461.74, 489.03, 489.03, 489.03, 489.03, 489.03, 493.65, 493.65, 493.65, 493.65, 493.65, 505.69, 505.69, 505.69, 505.69, 505.69, 513.35, 513.35, 513.35, 513.35, 513.35, 533.75, 533.75, 533.75, 533.75, 533.75, 539.2, 539.2, 539.2, 539.2, 539.2, 556.79, 556.79, 556.79, 556.79, 556.79, 524.06, 524.06, 524.06, 524.06, 524.06, 530.31, 530.31, 530.31, 530.31, 530.31, 535.09, 535.09, 535.09, 535.09, 535.09, 536.08, 536.08, 536.08, 536.08, 536.08, 544.34, 544.34, 544.34, 544.34, 544.34, 556.47, 556.47, 556.47, 556.47, 556.47, 558.84, 558.84, 558.84, 558.84, 558.84, 559.8, 559.8, 559.8, 559.8, 559.8, 565.26, 565.26, 565.26, 565.26, 565.26, 568.06, 568.06, 568.06, 568.06, 568.06, 574.22, 574.22, 574.22, 574.22, 574.22, 573.5, 573.5, 573.5, 573.5, 573.5, 574.34, 574.34, 574.34, 574.34, 574.34, 576.04, 576.04, 576.04, 576.04, 576.04, 577.12, 577.12, 577.12, 577.12, 577.12, 586.31, 586.31, 586.31, 586.31, 586.31, 588.44, 588.44, 588.44, 588.44, 588.44, 589.26, 589.26, 589.26, 589.26, 589.26, 590.97, 590.97, 590.97, 590.97, 590.97, 596.29, 596.29, 596.29, 596.29, 596.29, 596.86, 596.86, 596.86, 596.86, 596.86, 596.63, 596.63, 596.63, 596.63, 596.63, 601.32, 601.32, 601.32, 601.32, 601.32, 609.75, 609.75, 609.75, 609.75, 609.75, 618.56, 618.56, 618.56, 618.56, 618.56, 620.04, 620.04, 620.04, 620.04, 620.04, 616.66, 616.66, 616.66, 616.66, 616.66, 614.62, 614.62, 614.62, 614.62, 614.62, 615.52, 615.52, 615.52, 615.52, 615.52, 614.7, 614.7, 614.7, 614.7, 614.7, 617.99, 617.99, 617.99, 617.99, 617.99, 619.18, 619.18, 619.18, 619.18, 619.18, 617.49, 617.49, 617.49, 617.49, 617.49, 616.54, 616.54, 616.54, 616.54, 616.54, 607.22, 607.22, 607.22, 607.22, 607.22, 606.63, 606.63, 606.63, 606.63, 606.63, 606.0, 606.0, 606.0, 606.0, 606.0, 604.98, 604.98, 604.98, 604.98, 604.98, 604.61, 604.61, 604.61, 604.61, 604.61, 606.23, 606.23, 606.23, 606.23, 606.23, 608.18, 608.18, 608.18, 608.18, 608.18, 608.55, 608.55, 608.55, 608.55, 608.55, 609.51, 609.51, 609.51, 609.51, 609.51, 614.76, 614.76, 614.76, 614.76, 614.76, 615.21, 615.21, 615.21, 615.21, 615.21]

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 456 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713572968 --> 1713573596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.83, 28.83, 28.83, 28.83, 28.83, 29.05, 29.05, 29.05, 29.05, 29.05, 25.31, 25.31, 25.31, 25.31, 25.31, 24.8, 24.8, 24.8, 24.8, 24.8, 24.73, 24.73, 24.73, 24.73, 24.73, 24.03, 24.03, 24.03, 24.03, 24.03, 24.53, 24.53, 24.53, 24.53, 24.53, 25.1, 25.1, 25.1, 25.1, 25.1, 25.63, 25.63, 25.63, 25.63, 25.63, 25.74, 25.74, 25.74, 25.74, 25.74, 25.61, 25.61, 25.61, 25.61, 25.61, 24.97, 24.97, 24.97, 24.97, 24.97, 25.03, 25.03, 25.03, 25.03, 25.03, 25.01, 25.01, 25.01, 25.01, 25.01, 24.5, 24.5, 24.5, 24.5, 24.5, 24.09, 24.09, 24.09, 24.09, 24.09, 23.8, 23.8, 23.8, 23.8, 23.8, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.82, 23.75, 23.75, 23.75, 23.75, 23.75, 23.4, 23.4, 23.4, 23.4, 23.4, 23.36, 23.36, 23.36, 23.36, 23.36, 23.06, 23.06, 23.06, 23.06, 23.06, 22.73, 22.73, 22.73, 22.73, 22.73, 22.93, 22.93, 22.93, 22.93, 22.93, 22.78, 22.78, 22.78, 22.78, 22.78, 23.05, 23.05, 23.05, 23.05, 23.05, 23.06, 23.06, 23.06, 23.06, 23.06, 23.18, 23.18, 23.18, 23.18, 23.18, 23.13, 23.13, 23.13, 23.13, 23.13, 22.73, 22.73, 22.73, 22.73, 22.73, 22.83, 22.83, 22.83, 22.83, 22.83, 23.14, 23.14, 23.14, 23.14, 23.14, 23.13, 23.13, 23.13, 23.13, 23.13, 23.22, 23.22, 23.22, 23.22, 23.22, 23.26, 23.26, 23.26, 23.26, 23.26, 23.39, 23.39, 23.39, 23.39, 23.39, 23.48, 23.48, 23.48, 23.48, 23.48, 23.35, 23.35, 23.35, 23.35, 23.35, 23.27, 23.27, 23.27, 23.27, 23.27, 23.23, 23.23, 23.23, 23.23, 23.23, 23.09, 23.09, 23.09, 23.09, 23.09, 23.13, 23.13, 23.13, 23.13, 23.13, 23.16, 23.16, 23.16, 23.16, 23.16, 23.27, 23.27, 23.27, 23.27, 23.27, 23.36, 23.36, 23.36, 23.36, 23.36, 23.39, 23.39, 23.39, 23.39, 23.39, 23.32, 23.32, 23.32, 23.32, 23.32, 23.24, 23.24, 23.24, 23.24, 23.24, 23.03, 23.03, 23.03, 23.03, 23.03, 22.76, 22.76, 22.76, 22.76, 22.76, 22.42, 22.42, 22.42, 22.42, 22.42, 21.6, 21.6, 21.6, 21.6, 21.6, 21.36, 21.36, 21.36, 21.36, 21.36, 21.28, 21.28, 21.28, 21.28, 21.28, 21.29, 21.29, 21.29, 21.29, 21.29, 21.36, 21.36, 21.36, 21.36, 21.36, 21.42, 21.42, 21.42, 21.42, 21.42, 21.52, 21.52, 21.52, 21.52, 21.52]

Details

kv_cache_usage_ratio

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 456 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713572968 --> 1713573596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.44, 0.44, 0.44, 0.44, 0.44, 0.54, 0.54, 0.54, 0.54, 0.54, 0.52, 0.52, 0.52, 0.52, 0.52, 0.51, 0.51, 0.51, 0.51, 0.51, 0.49, 0.49, 0.49, 0.49, 0.49, 0.38, 0.38, 0.38, 0.38, 0.38, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11]

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 456 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713572968 --> 1713573596
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0]

Apr 20 '24 00:04 github-actions[bot]

@HanClinto I've taken a (permanent?) break from the alternative repetition implementation front.

Instead, I've updated the grammar parser's rewrites to match more closely the existing + / ? / * ones (and to match their performance), and I've updated the JSON converters to make use of the new syntax. Doesn't seem to make schema-constrained generation any faster, but hey, at least it's not any slower 😅 (and the resulting grammars are simpler, cf. test-json-schema-to-grammar.cpp)

hyperfine --warmup 1 --runs 10 \
    -L branch grammar-reps,master \
    --setup 'git checkout {branch} && \
             make clean && \
             make -j LLAMA_CURL=1 main' \
    'BRANCH={branch} \
        ./main \
            -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf \
            -j '"'"'{"items": {"type": "number"}, "minItems": 10, "maxItems": 100}'"'"' \
            -p "JSON list of 50 integers starting from 100000" \
            --seed 12345 --no-display-prompt'

Apr 20 '24 00:04 ochafik

Instead, I've updated the grammar parser's rewrites to match more closely the existing + / ? / * ones (and to match their performance), and I've updated the JSON converters to make use of the new syntax. Doesn't seem to make schema-constrained generation any faster, but hey, at least it's not any slower 😅 (and the resulting grammars are simpler, cf. test-json-schema-to-grammar.cpp)

That feels like a pretty clean approach. I'd be interested in revisiting the rewrite later, but I like that this change (currently) doesn't take too big of a bite.

This looks really good, and I'd like to see this get merged in! I'm going to do a final review and hopefully we can put a bow on this one.

Apr 24 '24 03:04 HanClinto

@HanClinto thanks for the review!

After this I'll take a deeper look at the optional whitespaces in JSON (removing them seems 15% faster; I wonder if keeping some of them does improve quality) & update or retire grammars/json.gbnf (or caveating it as just illustrative / recommend -j '{}' instead; every time I try and use it w/ Hermes-2-Pro-Mistral-7B.Q4_K_M it ends up in infinite newline tears :-( )

Apr 25 '24 00:04 ochafik

Love this work @ochafik! I plan to take a look soon, need to wrap my head around the parser changes and the new JSON schema converter (now in C++?).

Apr 26 '24 14:04 ejones

Love this work @ochafik!

@ejones thanks!!

need to wrap my head around the parser changes

Love your work on grammar sampling btw! Took me a while to wrap my own head around it, it's smart!

(Some related changes you might have missed: https://github.com/ggerganov/llama.cpp/pull/6616, https://github.com/ggerganov/llama.cpp/pull/6609, and upcoming https://github.com/ggerganov/llama.cpp/pull/6644)

and the new JSON schema converter (now in C++?).

Hehe yeah, as I was improving the schema support in https://github.com/ggerganov/llama.cpp/pull/5978 I realised I had to port the changes to JS, and thought it would be cool to have it in the server (like llama-cpp-python & Together AI do, w/ param "response_format": {"type": "json_object", "schema": ...}). There's a test that keeps the 3 versions in sync but I reckon we'll want to ditch the JS & Python versions at some point (replacing them w/ a C++ cli, once the C++ version can use libcurl to fetch remote $refs).

(other json updates: https://github.com/ggerganov/llama.cpp/pull/6555, https://github.com/ggerganov/llama.cpp/pull/6232, https://github.com/ggerganov/llama.cpp/pull/6661, https://github.com/ggerganov/llama.cpp/pull/6659)

Apr 26 '24 14:04 ochafik

llama.cpp llama.cpp copied to clipboard

grammars: x{min,max} repetition operator

Rewrite rules

TODO before undrafting:

llama.cpp
llama.cpp copied to clipboard