guidance
guidance copied to clipboard
[WIP] Replace EarleyParser with lexeme-based rust implementation
To preface this all, I want to note that the overall user experience of guidance should change minimally if at all.
This is primarily a behind-the-scenes change that offers
- Speedups (sometimes 6x, at least on my machine!) over the current implementation of
guidance - Some new pieces of plumbing (not yet a part of the public API) that aim to simplify development of programming language (PL) grammars
- Simplification of the core LLM-parser loop that makes our codebase a little less scary
llguidance
- llguidance is a library designed to handle the "low level" components of
guidance(mainly the implementation of the parser and its interactions with the tokenizer) - These components have been pulled out of
guidanceso that they can be reused in the server-side implementation of theAzureGuidanceendpoints (ensuring that behavior between these remote models and localguidance-controlled models are as consistent as possible)
guidance
Parser
- Removed
EarleyParserclass that operated at the byte level- Byte-by-byte consumption (usage: search token trie, using parser to validate acceptable tokens)
- Parse-tree built over individual bytes
- Introduced
TokenParserclass that operates at the token level- Token-by-token consumption (usage: directly compute token mask)
- Parse-tree built over lexemes (more on these later, but larger chunks of work than bytes)
- Still an Earley parser under the hood
- Light wrapper around rust-based
LLInterpreterclass fromllguidance
- Introduced
ByteParserclass which wraps aTokenParserand aByteTokenizer(a tokenizer with tokens directly corresponding to individual bytes plus aBOS/EOStoken)- Only used to replicate old byte-by-byte consumption and to give grammars a
matchmethod - These features only used in test suite (attempted to make changes to tests as minimal as possible)
- Hope to deprecate this class in the future (important bits could maybe be subsumed into
Mockmodel class?)
- Only used to replicate old byte-by-byte consumption and to give grammars a
Engine
- Removed token-trie search from
Engineclass in favor of directly using the mask from theTokenParser - A lot of code got deleted, hopefully making this implementation a lot less scary to newcomers (even if the scary bits just got pushed down into rust...)
- Current guidance allows grammars that are in an "accepting state" to sample a token without constraints. If the sampled token is not accepted by the grammar, then it will be treated as an EOS token and terminate the grammar (@Harsha-Nori @slundberg maybe one of you can confirm that I have this right?). This PR keeps this behavior, but note THIS LIKELY DOES NOT ALIGN WITH AZURE SERVER-SIDE IMPLEMENTATION. To ensure that this behavior is maintained (maybe worth a discussion), it should be moved into llguidance (@mmoskal)
Serialization
- The
LLInterpreterunderlying theTokenParserexpects JSON-serialized grammars- This serialization format is consistent with the format expected by remote
AzureGuidanceendpoints - The
RemoteEnginenow expects this serialization format (no more protobuf)
- This serialization format is consistent with the format expected by remote
LLInterpreterreturns JSON-serialized data (not exactly the same as what's returned byAzureGuidanceendpoints, but there is a lot of shared structure)- New file
_schema.pycontainspydanticschemas used to validate/parse these response structures
- New file
EngineCallResponsenow JSON serialized/validated withpydantic(no more protobuf)
New primitives:
GenLexemeSubgrammarRegularGrammar
To understand the new primitives, we need to understand how the new parser is different from the old one. While both are Earley parsers that support general context-free grammars, the smallest "atoms" that the new parser works with are more coarse-grained than the old one. The new parser works with lexemes, while the old one works with bytes.
Roughly, lexemes correspond to regular expressions (and string literals). These are larger chunks of text, making the parser more efficient in a lot of cases. Because lexemes are regular, the lexer (the lexeme sub-parser) can run much more quickly than the outer Earley parser.
While the Earley parser is able to handle ambiguities (e.g one_or_more("a") + one_or_more(select("a", "b")) -- which expression is responsible for the second "a" in "aab"?), the lexer can't. We need a deterministic set of rules that tells us how any given string should be lexed (what lexemes are responsible for what parts of the text).
Lexemes can be lazy or greedy.
- Lazy:
- Completes as early as possible
- A lazy lexeme will complete as soon as it matches, e.g. a lazy lexeme
r"a+"will only ever generate a single "a" andr"a*"will only ever generate the empty string.
- Greedy:
- Completes as late as possible
- A greedy lexeme will not complete until it fails to match, e.g. a greedy lexeme
r"a+"can produce as many "a"s as it wants before moving on to the next lexeme.
Gen
gen is composed of two sub-expressions - the "body" regex and the (optional) "stop" regex. If no body regex is passed, it defaults to r"(?s:.*)", i.e. .* that additionally matches the newline character.
When the stop regex is provided, gen behaves as a lazy lexeme. As soon as the full body+stop regex matches the generated text, we exit the gen (discarding the "stop" text) and move on to the next lexeme. This ensures that gen actually stops when the stop expression is produced.
When no stop regex is provided, it behaves as a greedy lexeme (with one caveat -- it can be terminated by an EOS token, which stops with lazy semantics). Note that regex is now just an alias for gen with no stop expression.
Examples:
gen(regex=r"[0-9]+") + "xyz"- No stop provided, so the gen is greedy.
- Must produce at least one digit, after which it is allowed to either produce more digits, the EOS token, or the letter "x".
- If it produces "x", then the string "yz" will be forced.
- If it produces the EOS token (or hits max_tokens), then the string "xyz" will be forced (resulting string will NOT have the EOS -- it will be dropped).
gen() + "xyz"- No stop provided, so the gen is greedy.
- Implicitly allowed to generate anything.
- Generating "x" will NOT force "yz", since generating "x" does not terminate the gen (
r"(?s:.*)"has not yet failed to match). Note that current guidance ALSO won't force "yz", as the parser will be in a "superposition" that doesn't know whether or not thegenhas completed. - Only an EOS (or hitting max_tokens) can terminate the gen
- GOTCHA (difference from current guidance):
- Current guidance: any string ending in "xyz" is allowed to terminate the grammar (i.e. an EOS is allowed but not forced). In practice, this probably won't happen often.
- This PR: the gen must first be terminated, at which point an "xyz" will be forced and the overall grammar will terminate
gen(regex=r"[0-9]+") + "123"- Greedy
- Generating "1" will not force "23"...
- (Same "gotcha" as above): only way to complete this expression is for the model to generate at least one number then the EOS (or it hits the token limit), at which poing "123" will be forced
gen(regex=r"[0-9]+", stop="123")- Lazy
- Will produce any number of digits, terminating as soon as "123" is generated (e.g. "2346452341412134123")
The subtle changes around EOS should be a fairly small detail .
Lexeme
Not (yet) part of "public" api (available via guidance._grammar.Lexeme or guidance.library._subgrammar.lexeme). Should only really be used when writing Subgrammars / translating EBNF.
- Consist of a single regular expression
- They are always greedy.
- Model not allowed EOS as an "out" (unless we are at the end of a grammar, of course)
- TODO:
lexemeto support acontextualflag (more on that later)
Subgrammar
Not (yet) part of "public" api (available via guidance._grammar.Subgrammar or guidance.library._subgrammar.subgrammar). Mostly exists to better support generating programming languages.
- Wraps a guidance grammar, which will then be treated as "atomic"/"terminal" from the perspective of the outer grammar's Earley parser (i.e. treated as a greedy lexeme).
- Can be terminated by an EOS or by generating non-matching string (e.g. if
jsonis a subgrammar,json() + "```"will terminate if a backtick is generated after some valid JSON) "ignore_regex"kwarg specifies a regular expression that will be "ignored" between lexemes. This can be used to allow flexible whitespace when generating JSON or code, for example.- Non-contextual lexemes given priority whenever any lexeme is being generated (e.g. to support keywords in PLs)
- Note:
jsonhas been reimplemented as aSubgrammar.
RegularGrammar
Not (yet) part of "public" api (available via guidance._grammar.RegularGrammar or guidance.library._grammar.as_regular_grammar)
NOTE: "manually" building regex-esque grammars should now be discouraged
- e.g.
select(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")])should be rewritten asregex(r"0|(?:[1-9][0-9]*)")
This is because the lexemes here are individual characters, requiring the expensive Earley parser to run. Rewriting as a regex makes the entire grammar into a lexeme, allowing the cheap lexer to do all the work.
If directly writing the regex is not possible, as_regular_grammar (name subject to change) can wrap a grammar like select(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")]) and (try to) convert it into a regex lexeme. Grammars that are not regular will fail this construction.
In the future, it would be nice to automatically wrap grammars when we can, preventing users from having to think about this.
Deprecations:
commit_point(raisesNotImplementedError, may reimplement in the future?)- was only used in
gento support the current "stop" mechanics and in tool calling
- was only used in
gendoes not support tool calling (working on this, hopefully will have something working before this PR goes through)
Biggest gotchas / changes from current guidance
regex(r"\d*") + "7"- Current guidance is allowed to emit an EOS after any sequence of digits ending in "7"
- Under this PR, guidance is allowed to emit an EOS after any sequence of digits, at which point a "7" will be forced
r"\d"now matches unicode digits
TODOs
- server-side engine variables
- EOS can't be explicitly referenced in the grammar, only implicitly at the end of
Gens orSubgrammars
- EOS can't be explicitly referenced in the grammar, only implicitly at the end of
- stopping gen on active role end
- Should be more trivial with server-side engine variables
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 80.97754% with 144 lines in your changes missing coverage. Please review.
Project coverage is 60.86%. Comparing base (
c4837b2) to head (ef0e182).
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## main #951 +/- ##
===========================================
+ Coverage 49.38% 60.86% +11.47%
===========================================
Files 63 62 -1
Lines 4898 4392 -506
===========================================
+ Hits 2419 2673 +254
+ Misses 2479 1719 -760
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The following code is resulting in a constraint exception that I can't repro on the main branch:
lm2 = models.OpenAI(model="gpt-4o")
with system():
lm2 += "You are a helpful assistant"
with user():
lm2 += "What is this shape?"
with assistant():
lm2 += gen(max_tokens=200, temperature=0.1)
It seems to trigger if gpt-4o tries to generate the text <|invalid_special_token|>. I notice that a generation finishes from the OpenAI API, then _reset_shared_data() is called, then another generation is retried, then the exception is thrown.
It seems to trigger if gpt-4o tries to generate the text <|invalid_special_token|>. I notice that a generation finishes from the OpenAI API, then _reset_shared_data() is called, then another generation is retried, then the exception is thrown.
I can reproduce this in the situation where <|invalid_special_token|> (which is bizarrely the EOS token?) is produced as a sequence of sub-tokens. @mmoskal we're not backtracking to re-code tokens unless we just fast-forwarded, right? I believe OG guidance is more liberal about when it re-codes.
Anyway, this means that that string is not interpreted as an EOS, and the grammar isn't stopped despite the stream from OpenAI stopping, causing this error. I can "hack" a solution by interpreting the stream terminating as an EOS (maybe looking at the last bit of generated text to check if it corresponded to an EOS?) but that would be maybe hiding a bigger problem. @mmoskal what are your thoughts on retokenization here?
Can you expand a bit on preferring regular expressions to things like zero_or_more() etc.? I would have thought the latter would be much more user-friendly to non-programmers (and to programmers for a bunch of cases).
Can you expand a bit on preferring regular expressions to things like
zero_or_more()etc.? I would have thought the latter would be much more user-friendly to non-programmers (and to programmers for a bunch of cases).
Yeah, of course. Just to be clear, I'm not saying we discourage usage of zero_or_more, etc. in general.
Rather, I'm suggesting that the objects they are being applied to should be "as big as possible". If they operate on bytes, then the EarleyParser will have as many state-sets as there are bytes. If they operate on larger string literals or regular expressions, the number of state-sets instead corresponds to "chunks" of bytes, i.e. we get to divide by some (hopefully big) number.
Worst case, the Earley algorithm has cubic time complexity in this number of state-sets (probably usually more like quadratic), so any chunking we can do can improve performance substantially. Mind you, I think we can automatically do a lot of this work (inferring regular subpatterns in grammars) for our users so they don't have to think about this.
It's totally worth doing some benchmarking to back up the theoretical claims I just made.
The following code is resulting in a constraint exception that I can't repro on the
mainbranch:lm2 = models.OpenAI(model="gpt-4o") with system(): lm2 += "You are a helpful assistant" with user(): lm2 += "What is this shape?" with assistant(): lm2 += gen(max_tokens=200, temperature=0.1)It seems to trigger if gpt-4o tries to generate the text
<|invalid_special_token|>. I notice that a generation finishes from the OpenAI API, then_reset_shared_data()is called, then another generation is retried, then the exception is thrown.
@nking-1 just pushed a fix to this. I realized that gpt-4o should never actually try to generate that text... we were putting words in its mouth and then creating downstream problems for ourselves. Give this a shot!
If we're switching to JSON serialisation, do we have a schema defined for that?
If we're switching to JSON serialisation, do we have a schema defined for that?
It would be good to have some pydantic schemas defined somewhere. Currently the source of truth is in the rust code here: https://github.com/microsoft/llguidance/blob/0201ad6be76bf15302c3d86bc649ab812840afe8/parser/src/api.rs#L4 (tag @mmoskal)
Looks good, great work!! :)
Only one main question:
regex(r"\d*") + "7" Current guidance is allowed to emit an EOS after any sequence of digits ending in "7" Under this PR, guidance is allowed to emit an EOS after any sequence of digits, at which point a "7" will be forced
This seems like it would be very surprising since the intention of the first line is clearly to match numbers that end in 7. Seems like an issue arising from the lexeme boundaries. I wonder if this could go away by greedily consuming things after a lexeme as long as the expression remains regular?
...not a blocking issue though, excited to see a merge :)
Another note:
Current guidance allows grammars that are in an "accepting state" to sample a token without constraints. If the sampled token is not accepted by the grammar, then it will be treated as an EOS token and terminate the grammar (@Harsha-Nori @slundberg maybe one of you can confirm that I have this right?).
Yes, that is needed for variable length patterns since you don't want to force the model to stop early.
Merged -- thanks @hudson-ai, @mmoskal and everyone who took time to review this (@nking-1 @nopdive @riedgar-ms @slundberg) :)