sbnf Some patterns generate unexpected redundant code

For some recursive or mutually recursive rules, the resulting code has redundancies that a human-written version wouldn't have. Unfortunately I'm too inexperienced to tell whether it's SBNF or my rules.

Consider this over-simplified example:

main = `[` (~main)* ~`]`;

SBNF generates the following contexts:

contexts:
  # Rule: main
  main:
    - match: '\['
      set: main|0
    - match: '\S'
      scope: invalid.illegal.mock
  # Rule: main
  main|0:
    - match: '\['
      push: main|0
    - match: '\]'
      pop: true

The human version would simply re-include main to avoid duplication:

contexts:
  main:
    - match: '\['
      push: pop-block
  pop-block:
    - match: '\]'
      pop: true
    - include: main

This becomes a big deal in syntaxes with many mutually recursive patterns. I just translated a work-in-progress LLVM syntax, whose .sublime-syntax version has 169 total lines, and the SBNF output has 1211 lines because many rules got duplicated inline instead of using include. Can provide the code if needed.

To better demonstrate duplication, consider this hypothetical over-simplified syntax for S-expressions (untested):

main = (~symbol|dot|list)*;
symbol = '[^\s().]+';
dot = `.`;
list = `(` (~main)* ~`)`;

The SBNF output clearly has duplication we don't want:

contexts:
  # Rule: list
  list|0:
    - match: '[^\s().]+'
      push: main
    - match: '\.'
      push: main
    - match: '\('
      push: [main, list|0]
    - match: '\)'
      pop: true
  # Rule: main
  main:
    - match: '[^\s().]+'
    - match: '\.'
    - match: '\('
      push: list|0

I hope I'm just doing this wrong. 🙂 Suggestions are appreciated. That said, even if the compiler already supports the "right" ways of doing this, it's too easy to write it the "wrong" way that still seems to work but generates inferior output. We might want to figure out how to safeguard the users from this.

Oct 27 '20 06:10 mitranim

Just found that this can be reproduced with a simpler pattern. Logging just in case.

main = `[` main* `]`;

This isn't equivalent to the example in the original post, since this emits an additional pattern for '\S' → illegal.

Oct 27 '20 06:10 mitranim

Realized that including (~main)* defined as (~a|b|c)* is equivalent to (~(~a|b|c)*)*. The writer can alleviate this with separate definitions:

main = (~any)*;
any = parens|braces|ident;
parens = `(` (~any)* ~`)`;
braces = `{` (~any)* ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';

Currently this changes the generated output, making it slightly simpler compared to this:

main = (~parens|braces|ident)*;
parens = `(` (~main)* ~`)`;
braces = `{` (~main)* ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';

Intuitively, I just want to write the second one, and expect the compiler to treat (~(~(~(~X)*)*)*)* at any depth as equivalent to (~X)*, but I might also be misunderstanding what they mean...

Oct 29 '20 06:10 mitranim

Also, trying to write the following gives a NYI panic:

main = (~parens|braces|ident)*;
parens = `(` main ~`)`;
braces = `{` main ~`}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';

main = (~parens|braces|ident)*;
parens = `(` main `)`;
braces = `{` main `}`;
ident = '\b[[:alpha:]_][[:alnum:]_]*\b';

Oct 29 '20 06:10 mitranim

expect the compiler to treat (~(~(~(~X)*)*)*)* at any depth as equivalent to (~X)*

I believe that this should be done, yes.

Feb 02 '21 22:02 FichteFoll

The human version would simply re-include main to avoid duplication:

include isn't used because it would significantly complicate the code-generation logic. Using include doesn't speed up the syntax in any way, so doing so would only have a small benefit from parsing the yaml. If the number of contexts get large it can also hurt readability to have includes all over the place - especially for the "inner" contexts like main|0.

Jun 25 '21 12:06 BenjaminSchaaf

sbnf sbnf copied to clipboard

Some patterns generate unexpected redundant code

sbnf
sbnf copied to clipboard