syn icon indicating copy to clipboard operation
syn copied to clipboard

impl PartialOrd for Cursor

Open CAD97 opened this issue 2 years ago • 14 comments

This is primarily intended for use with discouraged::Speculative parsing in order to implement one of the suggested error strategies, namely to show the error from the speculative parse which consumed the most tokens.

Example use case

I am implementing a macro which allows declaring a syntax tree enum and automatically provides a PEG-style parse implementation which tries each variant in order. This is, in fact, the express reason I added advance_to back in 2019; I guess it took me a while to actually go back and try to finish the experiment.

Yes, I know that syn is deliberately biased to LL(3) parsing. In fact, my original macro was a much more complicated grammar specification to allow it to describe LL(3) lookahead, but that complexity came at a cost of being painful to work both on and with. The intent of the new vastly simplified macro just using PEG is to use declarative PEG where it is enough, but still allow (and encourage) writing and using manual syntax elements mixed in with the declarative elements. Additionally, PEG exhibits much the same performance characteristics as LL(3) when the described grammar is in fact LL(3), as each speculative parse terminates within the three leading tokens, though it does still suffer from worse errors.

The currently emitted parse implementation is

impl $crate::syn::parse::Parse for $Name {
    fn parse(input: $crate::syn::parse::ParseStream) -> $crate::syn::parse::Result<Self> {
        use $crate::syn::parse::discouraged::Speculative;
        let mut best_guess = ::std::option::Option::None;
        let mut best_guess_variant = "";
        let mut best_guess_remaining = ::std::primitive::usize::MAX;
        $({
            let fork = input.fork();
            match fork.parse::<$crate::__switch! {
                if { $($NameVariant)? }
                do { $($NameVariant)? }
                else { [<$Name $Variant>] }
            }>() {
                ::std::result::Result::Ok(v) => {
                    input.advance_to(&fork);
                    return ::std::result::Result::Ok($Name::$Variant(v));
                }
                ::std::result::Result::Err(e) => {
                    // TODO: ask syn for a Cursor::cmp to optimize this
                    let this_guess_remaining =
                        fork.cursor().token_stream().into_iter().count();
                    if this_guess_remaining < best_guess_remaining {
                        best_guess = ::std::option::Option::Some(e);
                        best_guess_variant = ::std::stringify!($Variant);
                        best_guess_remaining = this_guess_remaining;
                    }
                },
            }
        })*
        ::std::result::Result::Err(input.error(format_args!(
            "expected {} but failed to parse any variant; best attempt was {} with {}",
            ::std::stringify!($Name),
            best_guess_variant,
            best_guess.unwrap(),
        )))
    }
}

Of note is the failure case:

::std::result::Result::Err(e) => {
    // TODO: ask syn for a Cursor::len to optimize this
    let this_guess_remaining =
        fork.cursor().token_stream().into_iter().count();
    if this_guess_remaining < best_guess_remaining {
        best_guess = ::std::option::Option::Some(e);
        best_guess_variant = ::std::stringify!($Variant);
        best_guess_remaining = this_guess_remaining;
    }
},

In order to determine how far the forked ParseBuffer got, the first instinct would be to compare the .span()s. However, comparing spans doesn't work in proc macros, and does the wrong thing anyway, since the spans could have arbitrary relations due to macro expansion. So I necessarily fall back to the only other option: counting the remaining unparsed tokens. With the Cursor API, this is most easily done by creating a new .token_stream(), but could also potentially be done by counting the length of an iterator constructed from successively using .token_tree() to get an advanced cursor.

However, the Cursor implementation has the needed information to say which Cursor is more advanced already, via the buffer entry pointers. Exposing Cursor ordering in this manner allows determining which speculative parse got the furthest just by comparing the cursors:

::std::result::Result::Err(e) => {
    let this_guess_cursor = Some(fork.cursor());
    if this_guess_cursor > best_guess_cursor {
        best_guess = ::std::option::Option::Some(e);
        best_guess_variant = ::std::stringify!($Variant);
        best_guess_cursor = this_guess_cursor;
    }
},

An alternative is to add an API directly to ParseStream which allows comparing them, but it can't be part of Speculative, because we didn't seal the trait.

CAD97 avatar Jul 08 '22 03:07 CAD97

(I'm up to date with master, so I don't think those CI failures are me? I cannot reproduce locally.)

CAD97 avatar Jul 08 '22 03:07 CAD97

I am on board with a PartialOrd impl, but I don't think this implementation works. With a None delimited group, comparing the following two cursors would give a random answer depending on how the memory allocator arranged the token buffers.

1 «2 3» 4
     ^  ^

dtolnay avatar Jul 08 '22 04:07 dtolnay

Oh, I misunderstood how the cursor implementation works when I skimmed lightly. (Mainly because since bump only does a ptr increment I thought the buffer must be contiguous.) I'll take another look sometime next week to see if I can figure out a working implementation. And write some tests.

CAD97 avatar Jul 08 '22 06:07 CAD97

Given the tree nature of the TokenBuffer, there's no clever way to compare cursors directly. As such, the simplest and perhaps only way works out: just keep track of how many entries we've walked past, and compare that count.

If you give me a pointer on where/how to add a test, I'll add a couple.

CAD97 avatar Jul 25 '22 00:07 CAD97

I am not pleased that this implementation adds bookkeeping overhead to every cursor operation (which underlies all parsing) even in programs that never use Speculative. That seems like the wrong tradeoff. Would you be able to come up with any other ways to implement this using only the existing data structures?

dtolnay avatar Jul 25 '22 00:07 dtolnay

I'm not super happy with it either. At the same time, though, this is a fairly small cost, so it might not be problematic?

I think the only solution for comparing Cursors that's zero overhead when not comparing cursors is to count the remaining tokens, i.e. call bump until we reach Cursor::End (iter::successors(Some(c), Cursor::skip)). This is at least better than what can be done externally (iter::successors(Some(c), |c| c.token_tree().map(|(_, c)| c)).count()), so perhaps that's good enough for including it (or perhaps just exposing skip) and letting speculative parsing eat the cost.

CAD97 avatar Jul 25 '22 00:07 CAD97

The reason I don't think anything better is possible is that cursors can be arbitrarily many None-delimited groups deep.

The best middle ground I can think of is to carry along a pointer to the base stream, compare that, and only walk if they have the same base pointer.

(This had a different idea previously but that was broken.)

CAD97 avatar Jul 25 '22 01:07 CAD97

cursors can be arbitrarily many None-delimited groups deep.

In practice they are not.

You should be able to efficiently compute the depth of a cursor without using skip, by traversing the scope pointer to jump directly to the end of the current group, and then traversing the Entry::End pointer to go up one group.

dtolnay avatar Jul 25 '22 01:07 dtolnay

Well, I've found a sufficiently clever solution, I think. Clever enough that I'm not 100% confident betting on it without some good tests. But I think this handles all cases without falling back to counting skips, and works fast in the common case that one or both pointers are not in a None-group.

cases to test
# both rooted
« ∘ ● ∘ ● ∘ »

# rooted, group before
« ∘ « ∘ ● ∘ » ∘ ● ∘ »

# rooted, group after
« ∘ ● ∘ « ∘ ● ∘ » ∘ »

# rooted, group at eof
« ∘ ● ∘ « ∘ ● ∘ » »

# different group
« ∘ « ∘ ● ∘ » ∘ « ∘ ● ∘ » ∘ »

# increased nesting
« ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ »

# all of the rooted cases, but in a group

...but this also lets me realize that this doesn't actually fully solve the problem that I want to solve, because if two speculative ParseStreams fail in separate positions in the same {} block, the Cursor held by both forks will still be to the same block.

Well, just call that a limitation. Even with my PEG-generating-macro, I still intend to also encourage enums to be hand-parsed in LL(3), rather than relying on PEG. (Also, Error::combine is new from the last time I was browsing the docs, I think; using that is perhaps the better approach, and offers an escape hatch if two forks do stop in the same token tree.)

CAD97 avatar Jul 25 '22 03:07 CAD97

If this implementation looks okay, I'll replace range.contains and find a good way to put some tests in.

CAD97 avatar Jul 25 '22 03:07 CAD97

As for tests: someday we should organize those but for now you can create a new top-level integration test file in the tests dir. Thanks!

dtolnay avatar Jul 27 '22 00:07 dtolnay

Sorry for stumbling in with an unexpectedly difficult problem 🙃 I honestly originally thought that self.ptr <=> other.ptr was sufficient[^3]...

I think I've determined my original use case isn't served by Cursor comparison[^1], but I'm still happy to see this PR to completion because it could be useful in other cases[^2] and just generally seems like something the API should provide.

I've got some time this weekend to pursue whichever option you think is most worth trying. (Walk-to-End(nullptr) and compare, that but cached for sort, or refactoring TokenBuffer to a completely flat buffer.)

After writing out this and the inline comment, I'm actually quite partial to trying for a flat tree repr...

[^1]: There's no way to get the Cursor of the failing location from a failed ParseStream::parse, so the forked failed Cursor the ordered choice parser actually has is fairly uselessly pointing to after the block the failure occurred in (but this is unspecified behavior).

[^2]: Consider e.g. instead of PEG's ordered choice, taking the successful arm from multiple which consumed the greater number of tokens.

[^3]: Giant[^5] refactor to make it sufficient: "just[^4]" serialize each nested group into the same contiguous buffer as they're encountered, and instead of Entry::Group(TokenBuffer) and Entry::End(*const Entry), have Entry::Group(usize) and Entry::End(usize), where the usize is the offset to the matching group start/end. The observation that makes this possible for TokenBuffer but not TokenStream is not doing consuming iteration handing out TokenStream to the group insides. This could be positive or negative on perf, though I'd give a slight odds for it being positive to store the whole thing in one buffer.

[^4]: With the scariest scare quotes possible.

[^5]: Though in theory, actually not that complicated. I've done this refactoring before with other immutable trees, and it's often surprisingly simple if nothing goes wrong. (And if something goes wrong, it's usually just not possible to do the transform.) If it works, it might even be simpler than the scheme for comparing Cursors into the treed tree implementation.

CAD97 avatar Jul 27 '22 04:07 CAD97

I'd be on board with pursuing a flat representation. I think your assessment is correct that it has odds in favor of performance positive, and obviously it makes PartialOrd easy and fast.

I am also open to exploring possible ways to get a more meaningful cursor from a failed parse on a fork.

dtolnay avatar Jul 27 '22 05:07 dtolnay

Oh for the flat representation work: make sure to fetch from master since I was poking around in TokenBuffer construction recently, since it had to be paged in for reviewing this PR anyway.

dtolnay avatar Jul 27 '22 05:07 dtolnay

Closing in favor of the much simpler #1223

CAD97 avatar Sep 28 '22 00:09 CAD97