instaparse icon indicating copy to clipboard operation
instaparse copied to clipboard

Feature request: namespaced non-terminals

Open or opened this issue 2 years ago • 10 comments

Hey, thanks for all the work, instaparse is amazing and fun to use. Apologies if I missed a similar previous discussions, I couldn't find anything on it.

I think it'd be great to have namespaced non-terminals. Ideally with name.space/foo = ..., which would naturally parse to [:name.space/foo ...]. This would allow grouping non-terminals to simplify some transformations. Of course it's possible to do that with prefixes or in some cases by grouping them with another non-terminal, but both make the transformation more complex and the grammar uglier (in my opinion). It also would allow some non-terminals to have the same keyword part, but in different namespaces/contexts, again that's possible with a prefix, but it's not quite the same and might require more cumbersome mapping in the transformation.

I saw that . was disallowed in 5322e65bd3adcba60125fb2fad3245a020cd58b2 as an alternative for ;, and / was disallowed in ef38fdc06771895d6a73184b6fce1fbeba1e1526 to use it for ordered choices.

I thought one could get around that to allow them only in explicit keyword-like non-terminals, e.g. :name.space/foo = ..., but of course : is parsed as a rule separator already.

Do you think it's possible to find an elegant way around this?

The best idea I had so far without breaking existing code/grammars is a "strict" build-parser option that enforces that all operators are surrounded by whitespace, and in that mode . and / are allowed inside non-terminals.

In case there's a partial solution: I think the / is more important than the ., because namespaces without dots would already be a win, but both would be first prize.

I'd happily try to draft a PR, if you think it's worth doing and can tell me which approach you'd prefer.

or avatar Apr 07 '22 09:04 or

I got a patch with a parser option specifically to allow namespaces, which adjusts the existing behaviour for . and / slightly: https://github.com/or/instaparse/commits/issue-211-namespaced-keywords (it looks bigger than it is, due to 2 unrelated trivial cleanup commits).

or avatar Apr 09 '22 19:04 or

Thanks for your pull request contribution and your interest in this change.

When I originally wrote instaparse, namespaced keywords weren't used very commonly; they became much more prevalent after datomic and spec came on the scene. So I used / and . as part of the grammar, and chose to be flexible on whitespace so formal specs drawn from other sources would be more likely to be usable as-is in instaparse. The consequence is that / and . in non-terminals are now problematic and support for namespaced keywords is awkward without changing some of the rules about how those characters are interpreted.

I care deeply about backwards compatibility, so I think you are right that if this is to be included, it can only be in the context of a new mode that is opt-in, so no existing instaparse grammars are broken. If the change is carefully isolated to the new mode and well-tested, and those characters retain their EBNF meanings when appropriate whitespace is added, then the main downsides are simply the added complexity of the code and updating the docs. Also, some thought needs to be given to whether the ABNF mode requires comparable changes for consistency.

I don't make changes to instaparse lightly at this point, so to justify that extra complexity, it would be helpful if you could elaborate a bit more on the benefits of direct support for namespaced keywords (as opposed to, say, prefixing your keywords with a prefix that is easy to recognize during the transform process to later convert them to namespaced keywords if needed).

Certainly it's hard for me to easily verify that the new regexes you've provided are bug-free, so also if you could give me some sense for your confidence in its correctness on both Clojure and Clojurescript, that would be helpful.

Engelberg avatar Apr 11 '22 08:04 Engelberg

Sure, my particular use case is a parser for "blazonry", a sort of medieval DSL to describe coats of arms. I'm working on a grammar for it here. I used a convention of uppercasing "terminals" (which really still are nts with some regex alternatives) and lowercasing other productions.

Some of those terminals can show up in several contexts but still have the same name, for instance ordinary/FESS and point/FESS, potentially with slightly different regex alternatives. And since I want to also parse blazonry in other natural languages while keeping the resulting AST somewhat similar, point/FESS and ordinary/FESS might not always actually have similar terminals, but that just as a sidenote on why I want these things separated even if the strings are the same at times.

All this could be done with prefixes, but I think namespaces are easier to read than ORDINARY-FESS or mix case with ordinary-FESS, my editor also can highlight the namespaces differently. In the transformation it is a little more annoying to get the :FESS out of the prefixed keyword to work with it, and dissecting keywords, which should be opaque, feels dirty.

Where prefixes stop working is when the namespaces are not necessarily known it advance. I don't have an actual use case for that, but it doesn't seem too farfetched.

Another situation I encountered before, which is kind of the opposite of the first case, is something like this:

foo/cmd = 'foo' | 'foo-alias'
foo/args = number number

bar/cmd = 'bar' | 'bar-alias'
bar/args = file number

other/cmd = 'other'
other/args = string*

S = foo/cmd foo/args
  | bar/cmd bar/args
  | other/cmd other/args

This could be done with cmd as the namespace instead: cmd/foo, etc., but in this example the transformation might want to throw away all namespaces as a first step, and then the transformation would find all these different commands in a :cmd node, their args in the :args node, without having to know about the namespaces in the grammar or requiring any code changes if more commands are added. I realize the example is a bit constructed, in this case the transformation could just always take the first node as the command node, but this isn't always possible. My grammar has rules where the order of children can vary, so being able to find it by :cmd, no matter which command it is, is great. Doing this with prefixes requires code changes if a new command shows up in the grammar. Or it needs some preprocessing that in advance knows all prefixed keywords like :cmd and :args and removes the prefix from any keyword that looks like *-cmd or *-args, but then an unrelated rule like default-args would be trouble. Namespaces sail around all these issues.

So the two benefits in my eyes are pleasant names in EBNF and being able to directly leverage all of Clojure's tooling around keywords during transformation, like destructuring namespaced keys or working with other libraries where namespaces are useful, for instance spec, rather than potentially complex manipulation of plain keywords.

As a concrete example, one thing I want to try in my code is to do transformations with multimethods like (defmulti transform-nt (comp namespace first)), which hands over transformation to some place that knows how to work with nts in that namespace, without having to disect the key or to know and mention all namespaces in the dispatch function.

The regex is not pretty. Perhaps this would be better as part of the grammar, actually! :) But I'm pretty confident that it works in Clojure and ClojureScript, it doesn't use anything fancy, and allows at most one /, because :a/b/c is valid in CLJ, but not in CLJS. I'll add some positive and negative tests for that. So far the branch is just a draft to sketch out the idea, I'll happily have a look at ABNF as well, and maybe you have a better idea of how to use the flag in cfg, that feels a little untidy.

Final thought: in any situation where several parsers could be combined, namespaces seem like a good idea. And if I'm not mistaken it's already possible to use instaparse with namespaced keywords by using combinators. That won't help me right now, because I want the more readable EBNF, but it means the string representation of those parsers currently could result in broken grammars, it certainly wouldn't pass the round-trip test. I doubt this has been a problem in reality, but it's worth mentioning.

or avatar Apr 11 '22 14:04 or

I believe you are correct that namespaced keywords are supported in the underlying grammar map, which suggests a simple-ish workaround if it is awkward to deal with hyphenated keywords at the transform stage -- deal with it before parsing at the grammar map stage!

So, you can build your parser using whatever hyphenated keywords you want, for example:

foo-cmd = 'foo' | 'foo-alias'
foo-args = number number

Then, you call (:grammar my-parser) to get at the grammar map.

Now, you can easily remap those keys to the namespaced keys of your choosing. You can do this simply enough with clojure's native constructs, but for convenience, you could use weavejester's medley library, which has a map-keys function:

(medley.core/map-keys {:foo-cmd :foo/cmd, :foo-args :foo/args} my-grammar-map)

Then, use instaparse's capability to turn the grammar map back into a Parser (you may need to restate the start rule and the output format).

I see in your grammar you have a rather large number of keywords and it would clearly be a hassle and error-prone to enumerate them all in a transform map. But you could have a scheme, for example -- or ~ represents / and then write a short function to do the conversion, and pass that function to map-keys like above.

So, let's say you have written a conversion function that converts keywords that fit some scheme into a namespaced keyword, passing through other things unchanged. Then, you can adjust your parser as follows:

(defn convert-parser-to-namespaced-keywords [parser]
  (assoc parser :grammar (medley.core/map-keys conversion-function (:grammar parser))
                        :start-production (conversion-function (:start-production parser))))

Short and sweet. You can do this at the point of building the parser, and then don't have to mess around with it in the context of your final transform pass of the output.

I think if I were doing instaparse from scratch today, given the current popularity of namespaced keywords, I would prioritize direct support for them. But given the way it currently works, I think the sort of workaround I describe here would be much less coding effort than building a new mode into instaparse. If you're in a hurry and just want results, that's what I would recommend.

However, if you are really motivated to work out the details of a new mode, I think the pull request would be useful.

Your point about round-tripping is good, so that's something I would want investigated as well, to make sure that the way these namespaced keyword parsers print can be read back in with the new mode.

Engelberg avatar Apr 11 '22 20:04 Engelberg

Post-processing the grammar is a good idea! Definitely better than doing it during transformation. And your suggestion to use some other characters instead of - would certainly deal with some of the potential problems I outlined. But it wouldn't make the EBNF much prettier, just seems like a disconnect to use ordinary~FESS in the grammar and then use ordinary/FESS in the code.

I'll try to refine the branch and create a proper PR, then you can decide whether it works for you. If not, then I can also just keep a fork, no worries. :)

The round-trip is already tested: https://github.com/or/instaparse/blob/issue-211-namespaced-keywords/test/instaparse/namespaced_nts_test.cljc#L24-L29, so that should work as long as the flag is set on re-parsing.

or avatar Apr 11 '22 20:04 or

A silly-but-maybe-useful idea to make the grammar prettier with minimal effort: Instead of / in your namespaced non-terminals, use Unicode 0338 ̸ which looks nearly identical. Then process the grammar.

Engelberg avatar Apr 12 '22 11:04 Engelberg

Heh, that's a bold idea, not something I'd find easy to work with in the editor.

But you gave me another idea: the grammar could use the normal / and then a preprocessing step could replace all those with something like --- (or whatever) in the string before the parser reads it, then afterwards a postprocessing step like the one you suggested could split at ---. It would be a solution with the current instaparse, as long as all / are used for namespaces or the replacement code is smart enough to only do it inside symbols and any order operator is surrounded by some whitespace.

Errors during parser creation would result in mismatches in the names and potentially the position, tho. It'd be a hack, and I think direct support is better, but it could work.

or avatar Apr 12 '22 11:04 or

I've had another go at this. The main change is a much simpler adjustment to simply allow . and / in non-terminals, without trying to enforce mandatory spaces anywhere else. The regex for nts itself also is much simpler now, only requiring the first character to not be / or ., as that wouldn't make sense for keywords and avoids parsing / and . on their own as nts. If a user wants to use a/b/c/d, then that's fine as well in Clojure, in ClojureScript the behaviour differs, but the same is true for such keywords in general in ClojureScript.

I think this is sufficient, because a user accidentally using a/b instead of a / b while :allow-namespaced-nts? is on would get an error message that makes it clear that a/b was parsed as an nt.

Finally, I looked into ABNF, and I don't think the change is needed or appropriate there. https://datatracker.ietf.org/doc/html/rfc5234#section-2.1 is rather strict when it comes to the names allowed. The feature could be limited to EBNF, just like the larger charset in general already is limited to EBNF.

or avatar Jun 18 '22 16:06 or

@Engelberg: did you get a chance to look at the patch?

or avatar Jul 25 '22 15:07 or

Is there a pull request?

Engelberg avatar Jul 25 '22 20:07 Engelberg