syn
syn copied to clipboard
impl PartialOrd for Cursor
This is primarily intended for use with discouraged::Speculative
parsing in order to implement one of the suggested error strategies, namely to show the error from the speculative parse which consumed the most tokens.
Example use case
I am implementing a macro which allows declaring a syntax tree enum and automatically provides a PEG-style parse implementation which tries each variant in order. This is, in fact, the express reason I added advance_to
back in 2019; I guess it took me a while to actually go back and try to finish the experiment.
Yes, I know that syn is deliberately biased to LL(3) parsing. In fact, my original macro was a much more complicated grammar specification to allow it to describe LL(3) lookahead, but that complexity came at a cost of being painful to work both on and with. The intent of the new vastly simplified macro just using PEG is to use declarative PEG where it is enough, but still allow (and encourage) writing and using manual syntax elements mixed in with the declarative elements. Additionally, PEG exhibits much the same performance characteristics as LL(3) when the described grammar is in fact LL(3), as each speculative parse terminates within the three leading tokens, though it does still suffer from worse errors.
The currently emitted parse implementation is
impl $crate::syn::parse::Parse for $Name {
fn parse(input: $crate::syn::parse::ParseStream) -> $crate::syn::parse::Result<Self> {
use $crate::syn::parse::discouraged::Speculative;
let mut best_guess = ::std::option::Option::None;
let mut best_guess_variant = "";
let mut best_guess_remaining = ::std::primitive::usize::MAX;
$({
let fork = input.fork();
match fork.parse::<$crate::__switch! {
if { $($NameVariant)? }
do { $($NameVariant)? }
else { [<$Name $Variant>] }
}>() {
::std::result::Result::Ok(v) => {
input.advance_to(&fork);
return ::std::result::Result::Ok($Name::$Variant(v));
}
::std::result::Result::Err(e) => {
// TODO: ask syn for a Cursor::cmp to optimize this
let this_guess_remaining =
fork.cursor().token_stream().into_iter().count();
if this_guess_remaining < best_guess_remaining {
best_guess = ::std::option::Option::Some(e);
best_guess_variant = ::std::stringify!($Variant);
best_guess_remaining = this_guess_remaining;
}
},
}
})*
::std::result::Result::Err(input.error(format_args!(
"expected {} but failed to parse any variant; best attempt was {} with {}",
::std::stringify!($Name),
best_guess_variant,
best_guess.unwrap(),
)))
}
}
Of note is the failure case:
::std::result::Result::Err(e) => {
// TODO: ask syn for a Cursor::len to optimize this
let this_guess_remaining =
fork.cursor().token_stream().into_iter().count();
if this_guess_remaining < best_guess_remaining {
best_guess = ::std::option::Option::Some(e);
best_guess_variant = ::std::stringify!($Variant);
best_guess_remaining = this_guess_remaining;
}
},
In order to determine how far the forked ParseBuffer
got, the first instinct would be to compare the .span()
s. However, comparing spans doesn't work in proc macros, and does the wrong thing anyway, since the spans could have arbitrary relations due to macro expansion. So I necessarily fall back to the only other option: counting the remaining unparsed tokens. With the Cursor
API, this is most easily done by creating a new .token_stream()
, but could also potentially be done by counting the length of an iterator constructed from successively using .token_tree()
to get an advanced cursor.
However, the Cursor
implementation has the needed information to say which Cursor
is more advanced already, via the buffer entry pointers. Exposing Cursor
ordering in this manner allows determining which speculative parse got the furthest just by comparing the cursors:
::std::result::Result::Err(e) => {
let this_guess_cursor = Some(fork.cursor());
if this_guess_cursor > best_guess_cursor {
best_guess = ::std::option::Option::Some(e);
best_guess_variant = ::std::stringify!($Variant);
best_guess_cursor = this_guess_cursor;
}
},
An alternative is to add an API directly to ParseStream
which allows comparing them, but it can't be part of Speculative
, because we didn't seal the trait.
(I'm up to date with master, so I don't think those CI failures are me? I cannot reproduce locally.)
I am on board with a PartialOrd impl, but I don't think this implementation works. With a None
delimited group, comparing the following two cursors would give a random answer depending on how the memory allocator arranged the token buffers.
1 «2 3» 4
^ ^
Oh, I misunderstood how the cursor implementation works when I skimmed lightly. (Mainly because since bump
only does a ptr increment I thought the buffer must be contiguous.) I'll take another look sometime next week to see if I can figure out a working implementation. And write some tests.
Given the tree nature of the TokenBuffer, there's no clever way to compare cursors directly. As such, the simplest and perhaps only way works out: just keep track of how many entries we've walked past, and compare that count.
If you give me a pointer on where/how to add a test, I'll add a couple.
I am not pleased that this implementation adds bookkeeping overhead to every cursor operation (which underlies all parsing) even in programs that never use Speculative. That seems like the wrong tradeoff. Would you be able to come up with any other ways to implement this using only the existing data structures?
I'm not super happy with it either. At the same time, though, this is a fairly small cost, so it might not be problematic?
I think the only solution for comparing Cursor
s that's zero overhead when not comparing cursors is to count the remaining tokens, i.e. call bump until we reach Cursor::End
(iter::successors(Some(c), Cursor::skip)
). This is at least better than what can be done externally (iter::successors(Some(c), |c| c.token_tree().map(|(_, c)| c)).count()
), so perhaps that's good enough for including it (or perhaps just exposing skip
) and letting speculative parsing eat the cost.
The reason I don't think anything better is possible is that cursors can be arbitrarily many None
-delimited groups deep.
The best middle ground I can think of is to carry along a pointer to the base stream, compare that, and only walk if they have the same base pointer.
(This had a different idea previously but that was broken.)
cursors can be arbitrarily many
None
-delimited groups deep.
In practice they are not.
You should be able to efficiently compute the depth of a cursor without using skip
, by traversing the scope
pointer to jump directly to the end of the current group, and then traversing the Entry::End
pointer to go up one group.
Well, I've found a sufficiently clever solution, I think. Clever enough that I'm not 100% confident betting on it without some good tests. But I think this handles all cases without falling back to counting skip
s, and works fast in the common case that one or both pointers are not in a None
-group.
cases to test
# both rooted
« ∘ ● ∘ ● ∘ »
# rooted, group before
« ∘ « ∘ ● ∘ » ∘ ● ∘ »
# rooted, group after
« ∘ ● ∘ « ∘ ● ∘ » ∘ »
# rooted, group at eof
« ∘ ● ∘ « ∘ ● ∘ » »
# different group
« ∘ « ∘ ● ∘ » ∘ « ∘ ● ∘ » ∘ »
# increased nesting
« ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ »
# all of the rooted cases, but in a group
...but this also lets me realize that this doesn't actually fully solve the problem that I want to solve, because if two speculative ParseStream
s fail in separate positions in the same {}
block, the Cursor
held by both forks will still be to the same block.
Well, just call that a limitation. Even with my PEG-generating-macro, I still intend to also encourage enums to be hand-parsed in LL(3), rather than relying on PEG. (Also, Error::combine
is new from the last time I was browsing the docs, I think; using that is perhaps the better approach, and offers an escape hatch if two forks do stop in the same token tree.)
If this implementation looks okay, I'll replace range.contains
and find a good way to put some tests in.
As for tests: someday we should organize those but for now you can create a new top-level integration test file in the tests dir. Thanks!
Sorry for stumbling in with an unexpectedly difficult problem 🙃 I honestly originally thought that self.ptr <=> other.ptr
was sufficient[^3]...
I think I've determined my original use case isn't served by Cursor
comparison[^1], but I'm still happy to see this PR to completion because it could be useful in other cases[^2] and just generally seems like something the API should provide.
I've got some time this weekend to pursue whichever option you think is most worth trying. (Walk-to-End(nullptr)
and compare, that but cached for sort
, or refactoring TokenBuffer
to a completely flat buffer.)
After writing out this and the inline comment, I'm actually quite partial to trying for a flat tree repr...
[^1]: There's no way to get the Cursor
of the failing location from a failed ParseStream::parse
, so the forked failed Cursor
the ordered choice parser actually has is fairly uselessly pointing to after the block the failure occurred in (but this is unspecified behavior).
[^2]: Consider e.g. instead of PEG's ordered choice, taking the successful arm from multiple which consumed the greater number of tokens.
[^3]: Giant[^5] refactor to make it sufficient: "just[^4]" serialize each nested group into the same contiguous buffer as they're encountered, and instead of Entry::Group(TokenBuffer)
and Entry::End(*const Entry)
, have Entry::Group(usize)
and Entry::End(usize)
, where the usize
is the offset to the matching group start/end. The observation that makes this possible for TokenBuffer
but not TokenStream
is not doing consuming iteration handing out TokenStream
to the group insides. This could be positive or negative on perf, though I'd give a slight odds for it being positive to store the whole thing in one buffer.
[^4]: With the scariest scare quotes possible.
[^5]: Though in theory, actually not that complicated. I've done this refactoring before with other immutable trees, and it's often surprisingly simple if nothing goes wrong. (And if something goes wrong, it's usually just not possible to do the transform.) If it works, it might even be simpler than the scheme for comparing Cursor
s into the treed tree implementation.
I'd be on board with pursuing a flat representation. I think your assessment is correct that it has odds in favor of performance positive, and obviously it makes PartialOrd easy and fast.
I am also open to exploring possible ways to get a more meaningful cursor from a failed parse on a fork.
Oh for the flat representation work: make sure to fetch from master since I was poking around in TokenBuffer construction recently, since it had to be paged in for reviewing this PR anyway.
Closing in favor of the much simpler #1223