grmtools icon indicating copy to clipboard operation
grmtools copied to clipboard

Serde support for `ParseGenericTree`

Open ratmice opened this issue 6 months ago • 10 comments

Currently the parse_generictree function can be used to return a Node from there Node::pp can convert a parse tree, into a textual representation as used by nimbleparse.

One thing I'm slightly interested in is extending the output format to be more machine readable. As such one thought is that we could include a function similar to Node::pp that also takes a serde::Serializer, but instead of returning a String returns something from the serde data model.

Here are a couple questions:

  1. This can likely be done outside of the repo, in the same way that we can copy/paste the pp function implementation and build it outside the crate as part of a binary to customize the output.
  2. It isn't clear to me what exactly the best translation into the serde data model would be.
  3. type signature/type of the return value?
  4. Reusing Node or not?
  5. We likely would need to choose some specific Serializer impl for e.g. nimbleparse/test_files
  6. Then does it make sense to
In repo or out?

One reason to include this in the repo is that it could then be used by the test_files, which would make it possible to introduce a known file extension which contains both parser input, as well as a serialized form of parser output.

That seems to me like the most direct argument for it's inclusion, but usage in tools like nimbleparse would also likely be deciding on a concrete Serializer type too. So perhaps it is worth experimenting with this outside the repo before making these kinds of decisions.

AST translation into serde data model

One way would be to just use serialize_map with String keys, recursively use serialize_map for the values. This seems possible, because serialize_map allows duplicate keys. However I somewhat fear that it seems likely that only a subset of serde serializers implementations will allow this duplicate keys.

So another option would be to use something more S-Expression like, using serialize_seq combined with serialize_tuple

In theory regarding the first question we could choose the map approach still even if it only works for a subset of Serializer implementations, if it is most natural for e.g. the chosen output format of nimbleparse with the idea that it is still possible to use the tuple/seq approach outside.

Type signature

It's been a while since I've worked with the Serializer trait directly, presumably we need to because the signature of pp doesn't stand alone in a way that we can throw Serialize bounds, but requires the source string, to turn indices into names at various points. This would seem to leave us with something like the following signature.

pub fn serde_serialize<S: serde::Serializer>(&self, grm: &YaccGrammar<StorageT>, input: &str, serializer: S) -> ???

But it isn't entirely clear to me what it should return, as S::Ok doesn't have any trait bounds like Write etc... looking specifically at serde_json::Serializer::into_inner leads me to believe that this should return S.

Reusing Node

Due to the complexity of the third question, including Serializer and not having a good Self type for a Serialize impl, perhaps it'd be better to just try and implement Serialize on a newtype around Node,

#[derive(Serialize)]
struct SerializableNode<'a>(Node<LexemeT, StorageT>, &'a str)

Or have a parse_serde which returns a SerializableNode?

Choice of format for nimbleparse/test_files

I personally think ron format is likely to the be one of the nicest formats for test_files, because it has rusts r#"raw string"# syntax, which makes embedding arbitrary input text along with serialized AST

Just use format directly and skip serde?

Given all the issues and the need to bless serialization format impl within binaries like nimbleparse, perhaps it is worth skipping the serde data model, and just using the chosen file format directly such as pp_ron(&self, &str) -> String

In many ways it seems like we're introducing complexity of the generic serde data model, but in some ways due to the usage of binaries, we might not benefit from it

Final thoughts

This is much longer, and probably require more thought and effort than I initially imagined it could. But I really don't feel like I've tinkered with it enough (or at all really) to have formed any real opinions neither informed nor strong.

ratmice avatar Jun 05 '25 09:06 ratmice

So after thinking it over, I think my inclination is to not try to reuse Node, but to try something like SerializableNode mentioned above instead. Besides simplifying things, another benefit of that is that it can potentially also include the source string in the serialized output. That would make it also useful for generating the input strings.

I don't see a whole lot in the way of downsides, we likely want an alternative to parse_generictree, but in theory could construct one from just the output of parse_generictree and a source string.

ratmice avatar Jun 07 '25 08:06 ratmice

In retrospect, parse_generictree is probably overspecific. I wonder if we can find a more generic (no pun intended API) that lets the user specific whatever functions / type they want? Then parse_generictree (or serializedtree or whatever) could be expressed in terms of that.

An obvious question is "can this be expressed already?" Theoretically "yes" but in practise "pretty much no": parse_actions requires the user to create a very scary array of functions that matches the number of productions in the grammar.

So I wonder if we can find something that's a sort of super simplified parse_actions where we only require the user to give us 1 or 2 functions? I haven't thought deeply about this -- and it might run into all sorts of type / borrow checker problems -- but perhaps something like this?

fn parse_map<FTerm, FNonterm, Node>(fterm: FTerm, fnonterm: FNonterm)
where 
      FTerm(LexemeT),
      FTerm: Fn(RIdx<StorageT>, Vec<Node>) -> Node;

Where the aim is that the user can construct their own Node type rather than being forced to use our enum. Perhaps we even need to allow them to specify TermNode and NonTermNode types to be sufficiently flexible?

ltratt avatar Jun 09 '25 07:06 ltratt

I think I follow, e.g. if you add a self, and a lexer to mirror e.g. https://docs.rs/lrpar/latest/lrpar/struct.RTParserBuilder.html#method.parse_generictree There is still (at least) one thing that is used by the Serialize impl that isn't currently passed to parse_* which is the input src &str, we'd likely need to use the param argument of parse_actions to pass that in, I don't (off-hand) think there are borrow checker issues, because it's a shared pointer, even though lexer is also going to borrow it?

But I will note one thing, in the prototype PR that I made, the entirety of the SerializableNode implementation (including the trait serde impl) can actually just be copy/pasted and built within a downstream on the crate the current releases. But once you have that Node wrapper (in or outside of grmtools) it's currently as simple as

serde_json::to_string(SerializableNode::new(grm, src, rt_parser.parse_generictree(...).0.unwrap()))

I think a parse_serializabletree as a simple variation on that, but I don't think we can actually produce a serializedtree, because functions like serde_json::to_string are not part of the serde interface, and each serializer (json, ron, and so on) has a different interface for that.

So I don't think it's as bad as being "pretty much no" in practice, but it's not necessarily the most obvious impl. Or at least it is a slightly jumbled variation of what we currently have regarding ownership.

ratmice avatar Jun 09 '25 08:06 ratmice

There is still (at least) one thing that is used by the Serialize impl that isn't currently passed to parse_* which is the input src &str

I'm not sure but I think that the user can just grab that in the closure? So I reckon they can do something like:

let s = "...";
parse_map(|lexeme| s[lexeme.start..lexeme.end], ...)

?

ltratt avatar Jun 09 '25 09:06 ltratt

Yeah, that definitely seems true, in the same way we can use it as a parse_param because &str: Copy, meaning we can also just move it into the closure without difficulty.

ratmice avatar Jun 09 '25 09:06 ratmice

FWIW, the more I think about it, the more I think they need to specify separate TermNode and NontermNode types. But it does feel like this generic interface would be much better than parse_generictree overall.

ltratt avatar Jun 09 '25 09:06 ltratt

Well, I'll do my best, from initial poking the lifetimes do seem quite intricate, because the actual closures seem like they need to be routed through the %parse_param extra parameter so it includes all the intricacy of that.

But beyond that i'm also uncertain what to do for the AStackType in this case. Anyhow it seems like it's going to take me at least a few days to understand the details, I see where we invoke fterm, but fnonterm hasn't really sunk in.

ratmice avatar Jun 09 '25 10:06 ratmice

But beyond that i'm also uncertain what to do for the AStackType in this case.

I don't think we should expose that in parse_map: they should, I think, get some sort of Vec<T> where T is some type of their own choosing (probably an enum that contains NodeTerm and NoneNonterm, but that's up to them).

ltratt avatar Jun 09 '25 11:06 ltratt

But beyond that i'm also uncertain what to do for the AStackType in this case.

I don't think we should expose that in parse_map: they should, I think, get some sort of Vec<T> where T is some type of their own choosing (probably an enum that contains NodeTerm and NoneNonterm, but that's up to them).

I managed to figure it out, and it was just a misunderstanding of my own making, I had put the Vec<T> in the wrong place where it then krept in to other generic types somewhat algebraically, into places where we merely had a T. One of those cases where you fix the problem by just not doing whatever dumb thing it was I was doing :)

ratmice avatar Jun 10 '25 05:06 ratmice

So far it seems to me like the output with the least amount of extra type overhead is the following:

#[derive(Serialize, Deserialize)]
#[serde(untagged)]
pub enum ASTMap<'a> {
    Term(&'a str, &'a str),
    NonTerm(&'a str, Vec<ASTMap<'a>>),
}

When serialized by the ron pretty printer it looks like:

("Expr", [
    ("Expr", [
        ("Term", [
            ("Term", [
                ("Factor", [
                    ("INT", "2"),
                ]),
            ]),
            ("*", "*"),
            ("Factor", [
                ("INT", "3"),
            ]),
        ]),
    ]),
    ("+", "+"),
    ("Term", [
        ("Factor", [
            ("INT", "4"),
        ]),
    ]),
])

Ron's readme says it doesn't have full support for the serde untagged attribute, but afaict it does appear to work with that structure though, in the sense that round trips equally.

ratmice avatar Jun 11 '25 15:06 ratmice

I think this one can be closed now with #592 in, it should be possible for users to implement themselves using the parse_map api.

ratmice avatar Oct 15 '25 21:10 ratmice