grmtools Feature Request: All Rules are "Top-Level" Functions

This is a great project and I've used it in a handful of Rust applications. Thanks!

There is one feature from the OCaml tools (ocamllex and ocamlyacc) have that I found convenient. In the code that ocamlyacc generates, all the grammar rules create functions you can call (if you need to parse a subset of the grammar.)

For instance, I'm working on a project that uses a string to describe data acquisition. There's a "name" field, optional range specification, option field names, and a field for the event on which to sample. All of that is straightforward to implement. However, we also have an API which would like just the event portion and another API that just wants the device name.

We could have a struct with a bunch of Option fields, but I'd rather be able to have Event as a parameter to a function.

If this feature would be too disruptive, I wonder how others would solve this with the current tools.

Feb 03 '25 18:02 rneswold

Right now, grmtools doesn't generate functions in this way. I guess it's possible to do so and to call "into the middle of" the LR statetable. I haven't thought about that before and it might need a bit of thought about exactly what it means.

In the interim, I think there is a (horrible) hack one can do: you can duplicate the grammar (including in a build.rs file), change the %start line and output to a different file(s).

Feb 03 '25 18:02 ltratt

Comes to mind that with the horrible hack, is seems undoubtedly likely to produce the 'unused rule' and 'unused token' warnings/errors. You'll likely need to set at least warnings_are_errors and more than likely want to set show_warnings to false entirely.

Feb 03 '25 18:02 ratmice

@ratmice Definitely! It would be good to do something nicer here, though an interesting question is what "unused" means if you have multiple start rules.

Feb 03 '25 18:02 ltratt

What if I created a grammar for each portion and then a grammar that called the subsets? I'd have to know where the previous parsing ended to feed the next one. I'd be nicer to have it all in one module, but this might be doable...

Feb 03 '25 18:02 rneswold

Comes to mind that with the horrible hack, is seems undoubtedly likely to produce the 'unused rule' and 'unused token' warnings/errors.

I would delete the unused rules in the "horrible hack". However, I was hoping to use the same lex file, so the "unused token" warnings would be a problem.

Feb 03 '25 18:02 rneswold

@ratmice Definitely! It would be good to do something nicer here, though an interesting question is what "unused" means if you have multiple start rules.

Interesting question, my inclination would be to define unused as unreachable from any start rule.

I'm assuming this is considering some sort of feature that lifts the (current) restriction that there is a single start rule, and making some sort of parser entry point for each start rule?

So for the purposes of checking unused rules the start rule would be treated as though each start rule were a production of an implicit rule, as in the following.

%start start1 start2
^: start1 | start2

I don't think it would be hard to to modify the unused_symbols function that does these checks to work in that way at least.

Feb 03 '25 19:02 ratmice

So for the purposes of checking unused rules the start rule would be treated as though each start rule were a production of an implicit rule, as in the following.
%start start1 start2
^: start1 | start2
I don't think it would be hard to to modify the unused_symbols function that does these checks to work in that way at least.

This is nice because I really don't need every rule to be top-level. Out of the 1/2 dozen sections of my string, I really only need 3 field parsers. But it might be too complicated to have some rules be top-level callable and others purely internal.

Also, each of these start targets is probably a different type (in my use, that's definitely the case.)

Feb 03 '25 19:02 rneswold

Also, each of these start targets is probably a different type (in my use, that's definitely the case.)

Ahh, yeah there are definitely complexities with this multiple start rules idea, that is one I hadn't considered. Another is that IIRC the start rule is given index 0 by default, rather than something like rules.len() which might be a more expandable location. But at least I think it seems like a reasonable interpretation of unused.

IIRC anyways, though I couldn't remember exactly where to look off-hand to verify this 0 index rule in the moment.

Feb 03 '25 19:02 ratmice

Is there a way to import .y files in other .y files? By 'inversing' the dependencies, it is probably prettier then copying the file many times.

Nvm I saw #110

Feb 06 '25 13:02 ajuvercr

Composition in a general sense is a very hard problem. This issue (relative to my memory of #110) is more limited: it's asking to subset an existing grammar. My intuition is that subsets are always OK -- we just don't happen to support taking advantage of that right now.

For example, if we generated Rust code directly, specifically one Rust function per rule, I suspect this would fall out of the hat. Doing so isn't rocket science (I think it's what e.g. LALRPOP does), but someone has to put in the hard yards.

Feb 06 '25 13:02 ltratt

I think what I'll do is use an enumeration. Something like:

enum DAQSpec {
    FullSpec { device_name: String,
               field: String,
               range: Option<Range<usize>>,
               event: Event },
    DeviceSpec(String),
    EventSpec(Event)
}

The grammar can recognize when a subset is specified and return the smaller-scoped enum values. Then I can make some simple wrapper functions:

fn parse_event(spec: String) -> Option<Event> {
    if let DAQSpec::EventSpec(ev) = parsing the input {
        Some(ev)
    } else {
        None
    }
}

Thanks for the discussion!

Feb 06 '25 16:02 rneswold