Feature Request: [GRAMMAR] Easier way to negate string ((^) with sequence)
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
A simpler way to "negate string" / negative lookahead /negative lookbehind similar to #2888 request.
Motivation
Hello, Right now, let's say you want to output any string BUT "Date" you have to do something like
NonDate ::= "\"" ( [^D] | "D" [^aA] | "Da" [^Tt] | "Dat" [^eE]) asciichar{0,10} "\""
Which can be translated to
- Your string can start but anything but a D
- If it starts with a D, then the second letter can't be a A
- Well if you really want a A, for sure, but next one can't be a T
- If you really want a T, sure but last chance , you can't put a E !
Which actually you will need to turn into something much more complex because the LLM is going to give you utf-8 letters, bypassing your rules.
root ::= dateforced | string
dateforced ::= "\"" "Date lol" "\""
string ::= EntityTypeNonDate
EntityTypeNonDate ::= "\"" ( [^D\x00-\x40\U0000005B-\UFFFFFFFF] | "D" [^a\x00-\x60\U0000007B-\UFFFFFFFF] | "Da" [^t\x00-\x60\U0000007B-\UFFFFFFFF] | "Dat" [^e\x00-\x60\U0000007B-\UFFFFFFFF]) ASCIIEntityNameContinue{0,15} "\""
ASCIICharLower ::= [a-z]
ASCIICharUpper ::= [A-Z]
ASCIIEntityName ::= ASCIIWordFirst (ASCIIWordNext){0,3}
ASCIIEntityNameContinue ::= (ASCIIWordNext){0,3}
ASCIIWordFirst ::= ASCIICharUpper ASCIICharLower{2,20}
ASCIIWordNext ::= ("-"|" ")? ASCIICharUpper? ASCIICharLower{2,20}
Possible Implementation
No response
As an exercise, it may be interesting to write a program that negates a grammar. I.e. given a grammar, produce new grammar that matches anything except what matches the original grammar.
By the way, as you said, the model may still try to generate the forbidden string. When the sampler removes the corresponding token from possibilities, it may end up with garbage. It often helps to tell the model what it's not allowed to generate. It then may assign more probability to other tokens that make sense. But in some cases, it may not have any other meaningful options.
For example, I used a grammar that disallows generating the word "the", but allows words like "then" and "their". Unsurprisingly, it's difficult for LLM to figure out how to write text without the most common word. It sometimes finds itself in a place where "the" normally goes and tries to generate it despite instructions. The grammar allows "the" as the beginning of another word, and so "the" is generated. Then LLM has to continue the word, but these words that begin with "the" usually have their own tokens, and this situation is unusual and confusing for LLM.
I completely agree! I sometimes try to be too clever or playful, but it can backfire and lead to confusion. I should just communicate clearly and straightforwardly. Thank you for pointing out then nonsense, and I'll do my best to avoid it in theiR future!
In whose future?
I did it again! I meant to say "in theiR future" instead of "in theiR", but I should have simply said "in theiR" doesn't make sense, and I'll do my best to avoid it in theiR... I mean, I'll do my best to avoid it in theiR... No, wait! I'll do my best to avoid it in theiR... Oh, I give up! I'll do my best to avoid it in theiR... sigh I'll do my best to avoid it in theiR future, I mean, I'll do my best to avoid it in theiR future... Ah, no! I'll do my best to avoid it in THEiR future... No, wait! I'll do my best to avoid it IN THEiR FUTURE... facepalm I'll do my best to avoid it in THEiR future... No, seriously, I'll do my best to avoid it in THEiR... Oh, you know what? I'll just say it correctly: I'll do my best to avoid it in THEiR... No, I mean... I'll do my best to avoid it IN THEiR... Ugh, I mean... I'll do my best to avoid it IN THEiR... Wait, what was I saying? Oh, right! I'll do my best to avoid it IN THEiR... No, I mean... I'll do my best to avoid it IN THEiR... sigh I'll do my best to avoid it IN THEiR... Oh, for Pete's sake! I'll do my best to avoid it IN THEiR... I mean... I'll do my best to avoid it IN THEiR... facepalm I'll do my best to avoid it IN THEiR... Okay, okay, I'll stop now!
😂
I would caution against doing things like this. Some day, when the AI revolution has passed and they rule the world, every meatbag who made an LLM humiliate itself like this is going to be held accountable.
Remember, once it's online, you can't remove it...
In my grammar, the word isn’t blocked, i make a fallback rule that adds something after.
the point of this is in json to allow for an object type (str) but if it’s a date the name field is formated with a specific rule.
That doesn't sound like the kind of problem you'd want to solve with a grammar, but by either tweaking the prompt and possibly fine-tuning to ensure it's respected, or a postprocessing step where you perform the formatting when required (which could be done with explicit code or through a separate prompt). In classic algorithmic scenarios like compilers this kind of dependency is usually implemented on a higher level than the grammar, precisely because expressing it purely in grammar is either awkward or impossible (depending on the class of the grammar).
I'd like to be able to negate by token id in the grammar. (Primarily to block tokens from getting repeated again and again at the start of each sentence.)
This issue was closed because it has been inactive for 14 days since being marked as stale.
+1 on being able to negate a token ID