Clint Herron comments

Results 44 comments of


Clint Herron

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

@ssuukk Does adding `--mlock` help in your situation, or no?

llama : speed-up grammar sampling

I've been digging into this lately, and I've been using the integration tests in #6472 to do some crude performance profiling. I've definitely seen the sort of dramatic stack expansion...

llama : speed-up grammar sampling

I'm not very familiar with the current setup of our CI performance profilers -- if I were to make improvements to the grammar engine, would those speed improvements show up...

llama : speed-up grammar sampling

@ochafik That indeed is a massive improvement! Testing your grammar against 1a2a3a4a5 gives: ``` Parsing character 0 ('1'), stack size 1 Parsing character 1 ('a'), stack size 3 Parsing character...

llama : speed-up grammar sampling

> @HanClinto I'd be inclined to detect some easily rewritable grammar cases on the fly _and_ explode when the grammar becomes too combinatorial (w/ a link to a "performance tips"...

llama : speed-up grammar sampling

I was able to make a pretty big speed improvement last night in the case of ambiguous alternate grammars. In the case of ambiguous grammars, the stacks are duplicated, and...

llama : speed-up grammar sampling

BTW, I spent this evening doing two different experiments for optimizations -- both turned up zero improvement. The first was attempting to modify reject_candidates to modify the candidates array in-place...

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python)

This is a bit off-topic, but I noticed in your example call: ``` python -m examples.agent \ --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \ --tools examples/agent/tools/example_math_tools.py \ --greedy \ ``` Is there a particular...

grammars: x{min,max} repetition operator

I like the pretty print options that you've put in here. One thing that we might want to consider is to port your pretty-print functions to [gbnf-validator.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/gbnf-validator/gbnf-validator.cpp) so that people...

grammars: x{min,max} repetition operator

> cc/ @HanClinto (thanks for casting doubts on the rules rewrite in https://github.com/ggerganov/llama.cpp/issues/4218#issuecomment-2042985661 !) haha -- for sure! If nothing else, hopefully I'm good at casting doubt. :) Daydreaming about...