sbnf
sbnf copied to clipboard
Some patterns generate unexpected redundant code
For some recursive or mutually recursive rules, the resulting code has redundancies that a human-written version wouldn't have. Unfortunately I'm too inexperienced to tell whether it's SBNF or my rules.
Consider this over-simplified example:
main = `[` (~main)* ~`]`;
SBNF generates the following contexts:
contexts:
# Rule: main
main:
- match: '\['
set: main|0
- match: '\S'
scope: invalid.illegal.mock
# Rule: main
main|0:
- match: '\['
push: main|0
- match: '\]'
pop: true
The human version would simply re-include main
to avoid duplication:
contexts:
main:
- match: '\['
push: pop-block
pop-block:
- match: '\]'
pop: true
- include: main
This becomes a big deal in syntaxes with many mutually recursive patterns. I just translated a work-in-progress LLVM syntax, whose .sublime-syntax
version has 169 total lines, and the SBNF output has 1211 lines because many rules got duplicated inline instead of using include
. Can provide the code if needed.
To better demonstrate duplication, consider this hypothetical over-simplified syntax for S-expressions (untested):
main = (~symbol|dot|list)*;
symbol = '[^\s().]+';
dot = `.`;
list = `(` (~main)* ~`)`;
The SBNF output clearly has duplication we don't want:
contexts:
# Rule: list
list|0:
- match: '[^\s().]+'
push: main
- match: '\.'
push: main
- match: '\('
push: [main, list|0]
- match: '\)'
pop: true
# Rule: main
main:
- match: '[^\s().]+'
- match: '\.'
- match: '\('
push: list|0
I hope I'm just doing this wrong. 🙂 Suggestions are appreciated. That said, even if the compiler already supports the "right" ways of doing this, it's too easy to write it the "wrong" way that still seems to work but generates inferior output. We might want to figure out how to safeguard the users from this.
Just found that this can be reproduced with a simpler pattern. Logging just in case.
main = `[` main* `]`;
This isn't equivalent to the example in the original post, since this emits an additional pattern for '\S'
→ illegal.
Realized that including (~main)*
defined as (~a|b|c)*
is equivalent to (~(~a|b|c)*)*
. The writer can alleviate this with separate definitions:
main = (~any)*;
any = parens|braces|ident;
parens = `(` (~any)* ~`)`;
braces = `{` (~any)* ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';
Currently this changes the generated output, making it slightly simpler compared to this:
main = (~parens|braces|ident)*;
parens = `(` (~main)* ~`)`;
braces = `{` (~main)* ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';
Intuitively, I just want to write the second one, and expect the compiler to treat (~(~(~(~X)*)*)*)*
at any depth as equivalent to (~X)*
, but I might also be misunderstanding what they mean...
Also, trying to write the following gives a NYI panic:
main = (~parens|braces|ident)*;
parens = `(` main ~`)`;
braces = `{` main ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';
main = (~parens|braces|ident)*;
parens = `(` main `)`;
braces = `{` main `}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';
expect the compiler to treat
(~(~(~(~X)*)*)*)*
at any depth as equivalent to(~X)*
I believe that this should be done, yes.
The human version would simply re-include main to avoid duplication:
include
isn't used because it would significantly complicate the code-generation logic. Using include
doesn't speed up the syntax in any way, so doing so would only have a small benefit from parsing the yaml. If the number of contexts get large it can also hurt readability to have includes all over the place - especially for the "inner" contexts like main|0
.