llama.cpp
llama.cpp copied to clipboard
grammar: handle misplaced special regex chars [*+?]
Don't try to access the previous element of a regex seq if the vector is empty. Prevents segment violation failures for regexes like ^?[0-9]$ and ^(?:a|b)$.
Fixes: #13390
Don't try to access the previous element of a regex seq if the vector is empty. Prevents segment violation failures for regexes like
^?[0-9]$and^(?:a|b)$.
The latter should have been handled (non-capturing groups are unsupported) here, but seems to just pass it on as if it was a normal pattern: https://github.com/ggml-org/llama.cpp/blob/19494ff82f40bfde882782b0dff414e4f806d235/common/json-schema-to-grammar.cpp#L450-L457
@rick-github I think both examples should report errors instead of quietly eating the expression characters.
The first example is a clearly invalid expression and should be reported as such, and the second is unsupported, preferably it should include the whole unsupported group in its error message.
This PR is just for removing a DoS vector for projects using the llama.cpp codebase. Improving the parser is out of scope.
I'm not talking about improving the parser, just reporting the error instead of quietly mangling the expression.
Just adding an error to the _errors stack and returning an empty pair might do the trick?