lark icon indicating copy to clipboard operation
lark copied to clipboard

An alternative syntax for terminals

Open dsuch opened this issue 5 years ago • 9 comments

Hello,

this is a small feature request.

The current syntax is to require all terminals be UPPER_CASE and, if I understand it correctly, load_grammar.py:TERMINALS is the place where it is implemented as a regexp.

However, I work with a grammar that has dozens and dozens of terminals and, particularly if there are terminals that refer to other terminals, it becomes a bit difficult to find one's way around it all because of the caps.

I am just finding that a grammar becomes difficult to read over time. If this was a terminal here and there once in a while, like in simpler grammars, that would be fine, but with a larger number of them, this is just difficult to read.

My suggestion is to add an alternative syntax, such as $terminal_name, in addition to TERMINAL_NAME.

I tried to do it in load_grammar.py myself but I stumbled upon exceptions that I could not resolve, hence I am unfortunately not able to do submit a PR - assuming that this feature requested is accepted at all.

As a kind of a dry run, I changed TERMINAL_NAMES to $terminal_names in the grammar and everything became much more quieter, without the all-caps distracting so much.

I am very happy with how Lark works, and I thank you for its creation, but this particular part is something that I would truly suggest giving a thought.

Thank you.

dsuch avatar Aug 02 '20 22:08 dsuch

Well, I got it to work by using an escaped $. It is probably not that much work (only the places where we check with isupper have to be changed), but I am not convinced this is a good idea. But @erezsh has the final say.

Both styles should be supported, both for backwards compatibility and for the common.lark file. Using an option like alternative_terminals=True would probably be a bad idea because of the requirement for a duplicated LoadGrammar instance and for imports. I am also not sure if it should be _$discarded_terminal or $_discarded_terminal.

MegaIng avatar Aug 03 '20 14:08 MegaIng

Hi @dsuch,

It's a bit of a subjective experience, although I admit I never had to work with so many terminals. Do you mind if I ask what are you parsing, and why do you need so many?

I'm concerned about two things re your request:

  1. Creating many ways to do the same things can cause confusion. There's a reason it's part of the Python Zen.

  2. I think rules and terminals should be easy to distinguish

But if we're already breaking these rules somewhat, why not just allow CamelCase terminals?

DiscardedTerminal: "foo"

erezsh avatar Aug 03 '20 14:08 erezsh

Thank you both for your input!

What I am parsing is source code of a programming language with a complexity comparable to that of Python or Java, perhaps somewhat simpler.

I tend to introduce terminals acting as named variables so as not to hardcode too much in the grammar, e.g.:

// Quote characters
QUOTS: "'"
QUOTD: "\""

// Any number of non-whitespace Unicode characters
UNI_NON_WS: /\w*/

// As above, but including a single quote character
UNI_NON_WS_QUOTS: /['\w]*/

// As above, but including a double quote character
UNI_NON_WS_QUOTD: /["\w]*/

// Single-line string characters - always enclosed in matching quotes
STR_QUOTS:  QUOTS UNI_NON_WS_QUOTD QUOTS
STR_QUOTD:  QUOTD UNI_NON_WS_QUOTS QUOTD
STR_SINGLE: STR_QUOTS | STR_QUOTD

The more complex the rules and terminals become the more distracting it is that they are all in uppercase - please note that the above is already shortened for clarity. For instance, previously, STR_QUOTS STR_QUOTD were STRING_QUOTE_SINGLE and STRING_QUOTE_DOUBLE which led to something like STRING_SINGLE: STRING_QUOTE_SINGLE | STRING_QUOTE_DOUBLE. And that is just a short example of a simple string.

I really like the idea of UpperCase terminals - I checked the example above and it gave:

// Quote characters
QuoteSingle: "'"
QuoteDouble: "\""

// Any number of non-whitespace Unicode characters
UniNonWS: /\w*/

// As above, but including a single quote character
UniNonWSQuoteSingle: /['\w]*/

// As above, but including a double quote character
UniNonWSQuoteDouble: /["\w]*/

// Single-line string characters - always enclosed in matching quotes
StringQuoteSingle:  QuoteSingle UniNonWSQuoteDouble QuoteSingle
StringQuoteDouble:  QuoteDouble UniNonWSQuoteSingle QuoteDouble
StringSingle: StringQuoteSingle | StringQuoteDouble

To me, this is more readable because the names do not have to be short and, at the same time, the terminals are not overly prominent.

As to whether a new keyword would be required - please correct me if I am wrong but would it not suffice to extend the regexp pattern in load_grammar.py to include an "|" for an "or"? In this way, both the previous and CamelCase syntax would be supported?

dsuch avatar Aug 03 '20 15:08 dsuch

An extreme implementation of this would allow nim style-insensitive identifiers everywhere.

  • Only case of first letter is relevant
  • Underscores are ignored (for lark, the first one wouldn't be ignored)

This means all of these are equivalant: HelloWorld, Hello_world, Hel_lowor_ld, but these would be others: hello_world, helloWorld, _HelloWorld, _helloworld.

Obviously we can keep the distinction between rules and terminals based on the case of the first letter, but it would be fully backward compatible. (well, except the underscore insignificance. that might be a problem if both helloworld and hello_world are defined in a grammar. but we can just add an option that now defaults to true and will later default to false).

MegaIng avatar Aug 03 '20 16:08 MegaIng

@MegaIng Have you read/written a lot of Nim code?

erezsh avatar Aug 03 '20 16:08 erezsh

@erezsh Not a lot, but I am currently working on a few small projects, including porting lark. Why?

MegaIng avatar Aug 03 '20 16:08 MegaIng

I have not used Nim but I spent a lot of time with case-insensitive languages and I can say without any doubt that it inevitably becomes a big maintenance burden if people are allowed to use such an approach in a project.

One person will use one convention, another group will use something else and after two years it is simply impossible to maintain such code.

The only way to make it usable has always been to enforce project-wide conventions where everyone uses the same kind of syntax.

I realise that, to us programmers, there are few limits, but I would very much encourage you not to go out on that path.

dsuch avatar Aug 03 '20 16:08 dsuch

CamelCase terminals would be very nice. I personally hate SCREAMING_CAPS

charles-esterbrook avatar Nov 11 '20 15:11 charles-esterbrook

In case someone wanted to implement it, I have an uncamelify function here in Zato.

This was not added in relation to Lark per se but the regular expression will work here too, feel free to make use of it.

dsuch avatar Nov 13 '20 14:11 dsuch