pest icon indicating copy to clipboard operation
pest copied to clipboard

[RFC] support custom state and hooks

Open skyzh opened this issue 1 year ago • 5 comments

Guided-Level Explanation

This PR adds a "context" for parsing with pest (for one of my course projects where I'm using pest...). Now users can parse with a custom state as a context. One example is the following grammar, where we want ints to parse a sequence of non-decrementing numbers.

WHITESPACE   =  _{ " " | "\t" | NEWLINE }
int          =  @{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT+ | ASCII_DIGIT) }
__HOOK_INT   =  _{ int }
ints         =  { SOI ~ __HOOK_INT* ~ EOI }

For all terms beginning with __HOOK, the generator will call user-defined hooks before deciding whether to parse.

    #[derive(Parser)]
    #[grammar = "../examples/hook.pest"]
    #[custom_state(crate::parser::CustomState)]
    pub struct Parser;

    pub struct CustomState {
        pub max_int_visited: usize,
    }

    impl Parser {
        #[allow(non_snake_case)]
        fn hook__HOOK_INT<'a>(state: &mut CustomState, span: Span<'a>) -> bool {
            let val: usize = span.as_str().parse().unwrap();
            if val >= state.max_int_visited {
                state.max_int_visited = val;
                true
            } else {
                false
            }
        }
    }

The state will need to support snapshot / recovery so that the hook can be placed anywhere. But users can also opt-out recovery and modify their grammar to avoid this situation. Examples in hook.rs. and hook.pest.

Implementation-Level Explanation

ParseState now contains a field for storing custom state. It is by-default set to ().

pub struct ParserState<'i, R: RuleType, S = ()>
where
    S: StateCheckpoint,

We now have a new parser trait called StateParser, where we accept a state + input and return rules + state.

/// A trait with a single method that parses strings.
pub trait StateParser<R: RuleType, S: StateCheckpoint> {
    /// Parses a `&str` starting from `rule`.
    #[allow(clippy::perf)]
    fn parse_with_state(rule: R, input: &str, state: S) -> Result<(S, Pairs<'_, R>), Error<R>>;
}

The generator crate is refactored to implement custom state generic type for all traits. Some functions also have new signatures in order to accept custom state.

The API change doesn't seem to affect any existing examples.

This PR is a demo and I'm open to suggestions / changes to this new feature. If it doesn't work I can also maintain it in my own fork. Thanks for review!

skyzh avatar Feb 26 '23 17:02 skyzh

This feels like a job for the compiler, not the parser. I know that some PEG libraries can do this, such as the one Python uses for interpreting Python code, but in generic libraries, makes me think we're overstepping the core premise.

I'm willing to be convinced, just want to get that out first.

NoahTheDuke avatar Feb 26 '23 18:02 NoahTheDuke

I totally agree that this adds too much complexity for the parser. But meanwhile I feel without custom state and hooks, it will be hard to do some context-aware parsing. An example is parsing a C-like language.

typedef int foo;
int main() {
  foo * bar;
}

Whether to parse foo as an identifier (and then parse the statement as expression multiply expression), or to parse it as foo* bar, depends on if foo is type-defined before. If we don't have a context, then we will need to do some extra work after parsing to revert the wrong decision, which can be painful.

I'm also thinking of removing CheckpointState trait in this PR, and we can have something like action in peg.js. This is easier to understand from the semantic aspect. https://pegjs.org/documentation

Hence, an alternative is to support a syntax like:

Typedef = { "typedef" ~ Type ~ Ident { SaveTypeAction } ~ ";" }
Type = { Ident { CheckTypeAction } /* user def type */ | Int | Void | Bool }

so that we can run some custom action to check / save something, while error recovery becomes some kind of undefined behavior.

skyzh avatar Feb 26 '23 23:02 skyzh

It's both a strength and a limitation of pest that it sticks to pure[^1] PEG, without any sort of user actions in the grammar.

[^1]: Technically speaking, PUSH/PEEK/POP break the quality of pest only parsing true context-free languages (of which PEG is known to be a subset and conjectured to be a proper subset), as it adds a manipulable stack, making the matching engine computationally in the class of turing machines rather than finite automaton.

If pest does want to support non-PEG semantics, I believe the way to go about doing so is to add the ability to fully define custom BUILTIN style rules (e.g. PUSH/PEEK/POP, though probably without the parameterization), similar to how logos callbacks function, and the current builtins. If it wants to do more advanced parsing, it can maybe recursively call into the parser its a part of, or use a different parser.

Any newly designed language SHOULD NOT emulate C's lexer hack. The documentation should make it clear that using the functionality is discouraged and should be avoided if that's at all an option.

To make these special rules stick out more (whether they're just binary acceptance functions or more sophisticated custom rules), they should look different from normal rules. Three arbitrary proposals could be HOOK_INT = native (only possible if the entire match is done natively via the ParserState API), native HOOK_INT = { ... } (obviously sticks out, gives a place to put a pre-hook grammar), or HOOK_INT = ?{ ... } (fits with other rule modifiers, but as such very quiet; % is also available as a character). I very much would like to avoid introducing any new special name semantics; I remain sadge at the opaqueness of WHITESPACE = _{ WHITE_SPACE } as an actual meaningful thing to write without those names being mentioned anywhere else.

The save/restore interface also imho should be improved before it's exposed to consumers; it should either just be a Clone of the state (if we just care about it working) or something along the lines of fn save(&self) -> Self::Snapshot; fn restore(&mut self, &Self::Snapshot); (if minimizing cloned state matters more), with Self::Snapshot being stored on a stack managed by the parser (e.g. Stack::Snapshot = usize).

Ultimately, I don't have any strong reason to block the availability of this functionality in pest. It feels very "pest3" in that it's adding entirely new capabilities on top of the existing jank, but it's certainly a need that people feel, so it's valid to serve that need.

CAD97 avatar Feb 27 '23 00:02 CAD97

Thanks a lot for your information!

HOOK_INT = ?{ ... }

I personally would prefer this grammar.

The save/restore interface also imho should be improved before it's exposed to consumers; it should either just be a Clone of the state (if we just care about it working) or something along the lines of fn save(&self) -> Self::Snapshot; fn restore(&mut self, &Self::Snapshot); (if minimizing cloned state matters more), with Self::Snapshot being stored on a stack managed by the parser (e.g. Stack::Snapshot = usize).

Agree.

@skyzh would you be interested in dusting off https://github.com/pest-parser/pest3 and trying to add hooks there? Once it's somewhat useable, we can perhaps add pest3 stuff here under a feature flag and/or v3 modules until it's stabilized and there's tooling that can help people to switch to a new grammar?

I can take a look but I'm not sure if it will work 🤪 The pest3 repo doesn't seem to have development activities since 2 years ago...

won't this be a semver-breaking change?

I tried designing the API to be compatible but it seems that this will be a semver-breaking change. If users have manually implemented the Parser trait, probably they will need to change it... Therefore, I would probably make hook behind a feature gate.


I think people still have diverged opinions on whether to have hooks built-in pest. Probably it would be a good idea to leave this PR open and see if there's any need for this feature in the future?

Meanwhile, to make hooks more useable in pest, I'll probably maintain my version of pest in this branch and see if there're potential users that are interested in using that...

  • Add everything behind a feature gate "custom-state"?
  • Add pest grammar support for int = ?{ ... } that will eventually call hook_int in the parser implementation, instead of the current __HOOK hacks.
  • Make the StateCheckpoint API more intuitive, or clone the full user snapshot instead of giving maximum control to user.

Thanks for your reviews and comments!

skyzh avatar Mar 03 '23 00:03 skyzh

I can take a look but I'm not sure if it will work 🤪 The pest3 repo doesn't seem to have development activities since 2 years ago...

Yeah, the pest3 repo is somewhat abandoned -- so it'd rather mean restarting the work of that reworked / simplified pest codebase there.

In any case, you can probably proceed as you outlined by continuing the updates in this PR. I think potentially this PR could be merged if the hooks functionality is exposed under a non-default feature flag.

tomtau avatar Mar 03 '23 14:03 tomtau