Cut (~) operator in grammar unclear
Python's Full Grammar specification and PEP 617 states:
~(“cut”): commit to the current alternative and fail the rule ~~even~~ [NB: typo?] if this [alternative] fails to parse
It is not clear to me whether the cut applies to all alternatives in the current parse tree or only the latest alternative. Apparently, there are both types of cuts, as has been discussed in this issue, and I think (?) the conclusion by Guido was
Your first quote convinces me that cut is meant to act locally (which is how I have implemented it).
But the current grammar contains rules where the cut appears (to me) to be useless, for example:
assignment_expression:
| NAME ':=' ~ expression
If cut only acts locally on this single rule, wouldn't the rule be the same without the cut, since there are no alternatives to skip over in case expression fails to parse?
Either way, it would be nice if the documentation could be clarified with regard to which kind of cut is used in Python's grammar.
CC @encukou
Hmm... this isn't clear to me either.
Looking at the implementation: if an alternative matches up to a cut, then other alternatives in the current Rhs are skipped. But, we don't quite define what Rhs is; and there's a flatten operation that inlines parenthesized groups into a “parent” Rhs.
@lysnikolaou, do you know what exactly a cut should do? (I'm interested in the intent more than the current behaviour.)
This might be an academic question as this point -- it does looks like in the documented grammar, all cuts are “useless”.
If cut only acts locally on this single rule, wouldn't the rule be the same without the cut, since there are no alternatives to skip over in case
expressionfails to parse?
Cuts are taken from the actual grammar as used by CPython. Some other details are omitted in the docs, though. So, while a cut might look “useless”, it might be there to support better error messages (invalid rules) or as an optimization.
I guess people expect formal grammars in docs to be simplified (in the mathematical sense), and we should note that this one isn't. (Generating simplified (but equivalent) grammar is on my TODO list, but currently blocked by other tasks that will take a long time.)
do you know what exactly a
cutshould do? (I'm interested in the intent more than the current behaviour.)
I can only confirm that when cut was initially added to the grammar, the intention was to locally commit to the current alternative, i.e. fail the whole rule if the current alternative fails.
So, while a cut might look “useless”, it might be there to support better error messages (invalid rules) or as an optimization.
Yup, cuts most of the time are optimizations. If I'm not mistaken, the current status is that we mostly cut before invalid alternatives when we know that no customized error messages will be relevant. If I remember correctly, there was a time when there were some significant cuts in the grammar but that's not true anymore.
assignment_expression: | NAME ':=' ~ expression
Skimming quickly through the grammar, it seems like this is indeed unnecessary. It might be a remnant from refactoring the rule but not removing the cut.
Sadly, even though I wrote the initial PEG parser, I don't recall the precise semantics of cut either. We're sure a bumbling set of experts here! :-)
I have a feeling though that it might mean "If we got this far, this is definitely the right alternative, and if we don't see what's expected after this point, that's an error". IIRC if we were to omit the cut, on failure to match the entire alternative would make the parser attempt other alternatives (also at a higher level than the current list of alternatives) causing less precise error reporting? Under this interpretation cut is not always just an optimization -- it may prevent a later non-error alternative from being taken in parts of the grammar where ordering matters (in general, the order of alternatives does matter with PEG, unless for classic context-free grammars).
CC'ing @apalala (creator of TatSu), maybe he recalls the exact rules for the cut operator.
Hi Guido,
Your interpretation is correct.
A parsed cut will prevent other alternatives from being tried.
When we discussed this before you had concerns about a cut escaping the current rule. I think TatSu does let it escape, and I think either way is good, though a cut in a one-alternative rule is hard to grok.
About the error reporting, most agree that the parser should keep a
furthest_pos, and report any errors against it.
You should consider what rules like this mean:
letter = ('a' ~ ) *. # should be equivalent to 'a'+?
FWIW, TatSu clears the memo cache by evicting any memos with a position lower than the one of a parsed cut. Today I'm changing that to a bounded cache of size 1024, and I see no performance differences with the parsers for Java and COBOL.
Happy Holidays!
Juancarlo Añez @.***
A parsed cut will prevent other alternatives from being tried.
I think we're all in agreement on that. What I'm not sure about (nor do I know what Python's PEG parser does nor whether it ever occurs in our grammar) is this:
start: xx | yy | zz
xx: ...
yy: NAME `=` ~ NUMBER | NAME '+=' NUMBER
zz: NAME = NAME
Here I wonder if on input foo = bar the zz alternative is considered or not -- IOW, is the cut global or local. Is there a standard answer in the world of PEG theory?
I didn't find much written about cut in PEG. There's a paper by Kota Mizushima that I think is about automatically introducing cut during grammar parsing.
I knew cut would be useful in Grako/Tatsu because of my experience in grammar parsing with PROLOG.
In your example the answer is yes, zz will be considered. What will never
be considered is the NAME += NUMBER alternative, because the cut is in
front of NUMBER, which means that the whole rule will fail if what's next
doesn't match NUMBER.
With:
yy: NAME = NUMBER ~ | NAME '+=' NUMBER
Every alternative will be considered.
See:
https://gemini.google.com/share/f241ecbade42
-- Juancarlo Añez @.***
On Thu, Dec 25, 2025 at 1:56 PM Guido van Rossum @.***> wrote:
gvanrossum left a comment (python/cpython#143054) https://github.com/python/cpython/issues/143054#issuecomment-3691638706
A parsed cut will prevent other alternatives from being tried.
I think we're all in agreement on that. What I'm not sure about (nor do I know what Python's PEG parser does nor whether it ever occurs in our grammar) is this:
start: xx | yy | zz
xx: ... yy: NAME
=~ NUMBER | NAME '+=' NUMBER zz: NAME = NAMEHere I wonder if on input foo = bar the zz alternative is considered or not -- IOW, is the cut global or local. Is there a standard answer in the world of PEG theory?
— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/143054#issuecomment-3691638706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEI5FMND4NLSFITU7IYL5L4DQQMHAVCNFSM6AAAAACPWKNUP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOJRGYZTQNZQGY . You are receiving this because you were mentioned.Message ID: @.***>
As to the error reporting, this is what TatSu does:
try:
rule = self._find_rule(start)
return rule()
except FailedCut as e:
self._set_furthest_exception(e.nested)
raise self._furthest_exception from e
except FailedParse as e:
self._set_furthest_exception(e)
raise self._furthest_exception from e
finally:
self._clear_memoization_caches()
def _set_furthest_exception(self, e):
if (
not self._furthest_exception
or e.pos > self._furthest_exception.pos
):
self._furthest_exception = e
In your parser, you'd have to save furthest_pos and furthest_errror, or
something like that.
-- Juancarlo Añez @.***
On Thu, Dec 25, 2025 at 2:23 PM Juancarlo Añez @.***> wrote:
I didn't find much written about cut in PEG. There's a paper by Kota Mizushima that I think is about automatically introducing cut during grammar parsing.
I knew cut would be useful in Grako/Tatsu because of my experience in grammar parsing with PROLOG.
In your example the answer is yes, zz will be considered. What will never be considered is the
NAME += NUMBERalternative, because the cut is in front of NUMBER, which means that the whole rule will fail if what's next doesn't match NUMBER.With:
yy: NAME
=NUMBER ~ | NAME '+=' NUMBEREvery alternative will be considered.
See:
https://gemini.google.com/share/f241ecbade42
-- Juancarlo Añez @.***
On Thu, Dec 25, 2025 at 1:56 PM Guido van Rossum @.***> wrote:
gvanrossum left a comment (python/cpython#143054) https://github.com/python/cpython/issues/143054#issuecomment-3691638706
A parsed cut will prevent other alternatives from being tried.
I think we're all in agreement on that. What I'm not sure about (nor do I know what Python's PEG parser does nor whether it ever occurs in our grammar) is this:
start: xx | yy | zz
xx: ... yy: NAME
=~ NUMBER | NAME '+=' NUMBER zz: NAME = NAMEHere I wonder if on input foo = bar the zz alternative is considered or not -- IOW, is the cut global or local. Is there a standard answer in the world of PEG theory?
— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/143054#issuecomment-3691638706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEI5FMND4NLSFITU7IYL5L4DQQMHAVCNFSM6AAAAACPWKNUP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOJRGYZTQNZQGY . You are receiving this because you were mentioned.Message ID: @.***>
I forgot to mention the interaction between cut and lookaheads.
For example:
statement = | &'if' ~ if_statement | &'while' ~ while_statement | ...
There's no point in trying a rule when there's a cheap way to narrow the alternatives.
-- Juancarlo Añez @.***
Okay. So my main question is now cleared up: cut only affects the current list of alternatives, skipping the remaining alternatives in the list if a cut was reached but a failure occurred after that in the same alternative. Ergo, a cut in the last or only alternative for a given rule has no effect.
Thinking about using cut in a parenthesized list of alternatives, I think the most sensitive interpretation is to let its scope be just that parenthesized list. So from that, ('a' ~) makes little sense, and I don't understand why ('a' ~)* would have a different meaning than 'a'*. So I am stull confused about @apalala's example. But maybe I don't follow what *. and *? mean? Those don't exist in Python's PEG parser generator, AFAIK.
In TatSu * means repeat zero or more times, and + means repeat one or more times.
There is an implicit choice in 'a'*, which is ('a'|()).
I think you're correct in that ('a' ~)* does nothing new. But consider ('a' ~ 'b')*.
Let me ask for permission to share the TatSu grammar for Java, which has plenty of uses of cut.
Note that cut may interfere with alternatives for error reporting or parser recovery, because those are usually the last alternative in a choice.
I changed TatSu so it doesn't let cut exceptions travel outside the nearest choice. Previously this grammar would always fail, and now it produces 'something', which a more sensible outcome.
GRAMMAR = ''' start = | one | two ;
one =
| ~ !() # cut fail
| 'abc';
two = `something` ; # result is the quoted text
'''
Juancarlo Añez @.***
On Fri, Dec 26, 2025 at 1:29 PM Guido van Rossum @.***> wrote:
gvanrossum left a comment (python/cpython#143054) https://github.com/python/cpython/issues/143054#issuecomment-3693143860
Okay. So my main question is now cleared up: cut only affects the current list of alternatives, skipping the remaining alternatives in the list if a cut was reached but a failure occurred after that in the same alternative. Ergo, a cut in the last or only alternative for a given rule has no effect.
Thinking about using cut in a parenthesized list of alternatives, I think the most sensitive interpretation is to let its scope be just that parenthesized list. So from that, ('a' ~) makes little sense, and I don't understand why ('a' ~)* would have a different meaning than 'a'*. So I am stull confused about @apalala https://github.com/apalala's example. But maybe I don't follow what *. and *? mean? Those don't exist in Python's PEG parser generator, AFAIK.
— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/143054#issuecomment-3693143860, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEI5FLCEMMJE2R2DQTBVND4DVV75AVCNFSM6AAAAACPWKNUP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOJTGE2DGOBWGA . You are receiving this because you were mentioned.Message ID: @.***>
Hi all, it looks like the main question is mostly settled already, but I want to add a brief note for completeness.
I’m Kota Mizushima, the first author of “Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space” (PASTE 2010), which introduced the cut operator in the context of packrat parsing.
In our work, the intended semantics of cut is local: it commits only within the current ordered choice (i.e., the current list of alternatives), and it does not “escape” to prune alternatives at higher levels of the parse. Operationally, once an alternative has matched up to a cut, any subsequent failure within that same alternative causes the whole choice to fail without trying the remaining alternatives of that choice.
This locality can also be seen via a standard PEG rewriting. For example, an expression of the form:
α ~ β / γ
can be represented (as a plain PEG without cut) as:
(α β) / (!α γ)
under the usual PEG semantics, which illustrates that the commit is confined to the current choice.
Regarding repetition: * typically desugars to an implicit choice (continue vs stop), e.g. E* can be viewed as a nonterminal A with:
A <- E A / ε
Therefore, ('a' ~)* can be seen as:
A <- ('a' ~) A / ε
In this particular case, the cut does not change the recognized language, because after consuming 'a' the continuation A can still succeed via ε, so the “failure-after-cut” situation does not arise. In contrast, patterns such as ('a' ~ 'b')* can change behavior: once 'a' is consumed, a failure to match 'b' cannot fall back to the ε branch to stop repeating, so the cut can matter.
In our implementation in the paper, the practical benefit of (local) cut is that it can enable safe memo-table pruning when the backtracking stack is empty; global cuts do not generally have the same “safe locality” property.
So, if CPython’s “~” is intended to be local to the current list of alternatives, that aligns closely with the original intent. A cut in the last/only alternative of a choice is indeed semantically redundant, though it may still be kept for error-reporting behavior, as an optimization artifact, or as a remnant after refactoring.
Kota Mizushima
@kmizu Thanks for your help! I hope you don't mind that I fixed some markup for you.
I have one final question for you: the effect of cut inside parentheses is (or should be) local to the parenthesized alternative(s), right? I think this follows from your examples but I'd rather hear an explicit answer.
The cut will be local to the parenthesized expression only if there is a choice within the parenthesis. Yet remember there's an implicit choice in closures and positive closures.
In Kota's notation, the expression ('a' ~ 'b')* translates to:
A <- ('a' ~ 'b') A / ε
which will fail on "ac".
There's also an implicit choice in optionals, like ('a' ~ 'b')?:
A <- ('a' ~ 'b') / ε
Note that the cut will be contained to the optional, but the whole optional will fail if input doesn't match, and that may affect its context.
-- Juancarlo Añez @.***
On Thu, Jan 1, 2026 at 10:54 AM Guido van Rossum @.***> wrote:
gvanrossum left a comment (python/cpython#143054) https://github.com/python/cpython/issues/143054#issuecomment-3703794441
@kmizu https://github.com/kmizu Thanks for your help! I hope you don't mind that I fixed some markup for you.
I have one final question for you: the effect of cut inside parentheses is (or should be) local to the parenthesized alternative(s), right? I think this follows from your examples but I'd rather hear an explicit answer.
— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/143054#issuecomment-3703794441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEI5FNPLBBLECYJ52VFPFL4EUYKTAVCNFSM6AAAAACPWKNUP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMBTG44TINBUGE . You are receiving this because you were mentioned.Message ID: @.***>
@gvanrossum Yes — in the “local cut” semantics intended in our work, a cut is scoped to the innermost ordered-choice construct.
Also, as @apalala noted, some operators introduce an implicit choice when desugared (e.g. repetition/optional), so even when you don’t see a / syntactically, there can still be a local choice boundary:
E* ≡ R <- E R / ε
E? ≡ E / ε
Concretely, cut only commits within the current list of alternatives. So in a pattern like:
(A ~ B / C) / D
a cut reached while parsing (A ~ B) can prevent trying C, but it should not prevent trying D. In other words, the cut does not “escape” beyond the nearest enclosing (explicit or implicit) ordered choice.
One small syntactic note (since it matters for scoping): in the PEGs we assume (following the common Ford-style convention), the ordered choice operator / is right-associative and has lower precedence than sequencing. So
A ~ B / C / D
parses as
A ~ B / (C / D)
whereas
(A ~ B / C) / D
is a different structure. This is exactly why parentheses can change which “list of alternatives” a cut belongs to.
As the proposer of cut in PEGs, I should also say: the original paper did not spell out every scoping corner case as crisply as it could have. I’m sorry if it contributed to today’s ambiguity.
Thanks, and no worries -- everything actually makes intuitive sense.
Also, @kmizu and co-authors -- thanks so much for inventing Packrat parsing. Without it Python wouldn't have been using PEG today, and we'd still be struggling with trying to fit the intended language in an LR(1) parser framework. Consequently we wouldn't have come up with soft keywords and we wouldn't have had the match/case statement.
While talking about cut, Kota mentions somewhere that the memo cache may be pruned when a cut is reached.
In my case, my COBOL parser underperforms if I remove the pruning, even with a bounded dict for the memo cache.
This is the current implementation of cut in TatSu:
def _cut(self):
self._trace_cut()
self._cut_stack[-1] = True
# Kota Mizushima et al say that we can throw away
# memos for previous positions in the tokenizer under
# certain circumstances, without affecting the linearity
# of PEG parsing.
# https://kmizu.github.io/papers/paste513-mizushima.pdf
#
# We adopt the heuristic of always dropping the cache for
# positions less than the current cut position. It remains to
# be proven if doing it this way affects linearity. Empirically,
# it hasn't.
def prune(cache, cut_pos):
prune_dict(
cache,
lambda k, v: k[0] < cut_pos and not isinstance(v,
FailedLeftRecursion), )
prune(self._memos, self._pos)
-- Juancarlo Añez @.***
On Thu, Jan 1, 2026 at 10:18 PM Guido van Rossum @.***> wrote:
gvanrossum left a comment (python/cpython#143054) https://github.com/python/cpython/issues/143054#issuecomment-3704339917
Also, @kmizu https://github.com/kmizu and co-authors -- thanks so much for inventing Packrat parsing. Without it Python wouldn't have been using PEG today, and we'd still be struggling with trying to fit the intended language in an LR(1) parser framework. Consequently we wouldn't have come up with soft keywords and we wouldn't have had the match/case statement.
— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/143054#issuecomment-3704339917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEI5FLUSAGX6WMLI5EAUND4EXINTAVCNFSM6AAAAACPWKNUP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMBUGMZTSOJRG4 . You are receiving this because you were mentioned.Message ID: @.***>
Thank you very much for the discussion!
It seems that the desired semantics aren't clear.
And since Python only uses ~ directly in the “top level” of a rule's alternative, we don't need to clarify them. Instead, we can deliberately leave these cases unspecified. If we need them in the future, we can do what we need.
I sent #143622 to "clarify" things and make sure we make a deliberate decision when we get to it.
I also extend my thanks! Feel free to close this issue at your convenience (or keep it open if you want to discuss more). My question is thoroughly answered.
Top-level cut can be good enough.
What TatSu does to handle all cases with consistent semantics is to use a
cut_stack. When an option is entered (|, ()*, ()+, []), parsing
pushes False onto the stack, and sets cut_stack[-1] = True if a cut is parsed. When the option finishes, success or failure, the cut_stack is popped.
@contextmanager
def _option(self) -> Generator[None, None, None]:
self.last_node = None
self._cut_stack += [False]
try:
with self._try():
yield
raise OptionSucceeded()
except FailedParse:
if self._cut_stack[-1]:
raise
else:
pass # ignore, so next option is tried
finally:
self._cut_stack.pop()