participle icon indicating copy to clipboard operation
participle copied to clipboard

Stable v2 release (API changes)

Open alecthomas opened this issue 5 years ago • 9 comments

Now that Participle has proven its initial concept, I think it's time to clean up the API. This will be a backwards incompatible change.

Work has started in the v1 branch.

  • [x] Consolidate on Stateful lexer (1444519b50b541162cd4c2f34b36f037a06f0fed)
  • [x] Optimise performance of the lexer (#111)
  • [x] Make specifying filename explicit. This removes confusion and ambiguity. (cf6162a6162b85732482ecec972cf1b793c1c80c)
  • [x] Get rid of unquoting hacks in text/scanner lexer. (4f53af941491a38b02c4ddd89c478c60529f1a8c)
  • [x] Clean up error functions. (895f942ebdd09acb65769727864b36160eed35b0)
  • [x] Eliminate internal unquoting and single quote munging from text/scanner based lexer. (4f53af941491a38b02c4ddd89c478c60529f1a8c)
  • [x] Extend the concept of Pos/EndPos to support capturing the full range of tokens the node matched, including Elide()ed tokens. (2ace05e38cbb7b034da7dfb53cf22f893dcd3d89)
  • [x] Refactor Mapper to eliminate the need for DropToken. (f82f61571f509811ffa0b970ab6f1ff49e5179d5)
  • [x] Capture directly into fields of type lexer.Token and []lexer.Token. (3b1f1514b50f2defd71f4197d8b44b654348274f)

Maybe:

  • [ ] Extend participle.Elide() support so that elided tokens can be captured explicitly by name (but also see next point).
  • [ ] Support streaming tokens from an io.Reader - currently the full input text is read.
    • [ ] Refactor PeekingLexer so it doesn't consume all tokens up front.

Once the API is stable, some additional changes would be welcome:

  • [ ] Optimise the parser.
  • [x] Code generation for lexing (e2b420f4e9a6e6dd07ffac14bc02c4692aaff423).
  • [ ] Code generation for parsing.
  • [ ] Improve error reporting.
  • [ ] Error tolerant parsing.
  • [ ] LSP support? Can this be generalised?
  • [ ] Generate syntax definition files for Textmate etc.?!

Regarding streaming, I'm not convinced this is a worth the considerable extra complexity it will add to the implementation. For comparison, pigeon also does not support streaming.

Additionally, to support the ability to capture raw tokens into the AST, participle will need to potentially buffer all tokens anyway, effectively eliminating the usefulness of streaming. It also vastly increases the complexity of the lexers, requiring three paths (io.Reader, string and []byte), PeekingLexer, etc.

This increased complexity is mainly due to the lookahead branching, and the lexer needs to have a similar implementation to the rewinder RuneReader code (https://play.golang.org/p/uZQySClYrxR). This is because for each branch the state of the lexer has to be stored but also, additionally, as each branch progresses it needs to preserve any new tokens that are buffered so that if the branch is not accepted the parent can remain consistent.

There's also a non-trivial amount of overhead introduced for reading each token, as opposed to the current PeekingLexer which is just an array index.

alecthomas avatar Sep 07 '20 11:09 alecthomas

Alright, feature request time

  • Find a way to get the original text of a match
  • Allow Tokens to be requested even though they're marked as elided
  • Have Lexer work over a Reader (with buffering) to allow for parsing huge files

ceymard avatar Sep 07 '20 11:09 ceymard

@ceymard Perhaps done better in participle, but currently we use io.TeeReader before we pass into the participle parser to keep the original text. We use this to construct error reporting and source mapping:

  • https://github.com/openllb/hlb/blob/master/parser/filebuffer.go
  • https://github.com/openllb/hlb/blob/master/parse.go#L66-L67

hinshun avatar Sep 08 '20 03:09 hinshun

@hinshun I'm doing doing something similar at the moment ; I just wish for something to get a match easily, without having to resort to that kind of trick.

ceymard avatar Sep 08 '20 07:09 ceymard

This functionality is now included natively. Any node with a field Tokens []lexer.Token will now be populated with the full set of tokens used to parse that node. There's an example in the tests here.

alecthomas avatar Sep 20 '20 05:09 alecthomas

You can also now capture directly into a field of type lexer.Token rather than string (for example).

alecthomas avatar Sep 20 '20 22:09 alecthomas

Do the Tokens include the ones from nested structs if the nested structs also have Tokens []lexer.Token?

hinshun avatar Sep 21 '20 04:09 hinshun

Yes they do.

alecthomas avatar Sep 21 '20 06:09 alecthomas

Do they include the elided ones as well ?

ceymard avatar Sep 21 '20 06:09 ceymard

Yep!

alecthomas avatar Sep 21 '20 07:09 alecthomas