JuliaSyntax.jl
JuliaSyntax.jl copied to clipboard
Where should the lexer live?
In https://github.com/JuliaLang/JuliaSyntax.jl/issues/31#issuecomment-1164175383, @pfitzseb said
Btw, we really should think about upstreaming the Tokenize changes in this repo... Pretty sure the opsuffix changes for
&&/||are implemented there.
We've chatted about this in various places and I mention it in the README. I'd like to resolve the double maintenance problem in some way, for sure :-)
But having modified Tokenize fairly extensively, I'm unsure whether the lexer should be versioned separately from the parser. Currently I see the lexer as serving the needs of parsing rather than something which is independent. Particularly because
- Lexing Julia correctly is impossible without keeping state. Worse, that state needs to be recursive for nested string interpolations. Other cases which need state or lookahead are prime (#25) and various contextual keywords like
outer. It's possible to add state to the lexer itself, but that's annoyingly redundant. And the redundancy of state becomes much worse when you consider recovery from malformed string interpolations. - JuliaSyntax can give you the disambiguated token stream in a flat format out of
ParseStream. It's fairly lightweight, no need to opt intoExpr(or other) tree building! - Parsing+lexing is currently only about half as fast as pure lexing.
So with those in mind, I feel like we could just recommend people use the full parser for purposes we previously used Tokenize.jl for? And that more tightly integrating the tokenizer source into JuliaSyntax might be best.
(Somewhat of a side note — I've also wondered whether we could do an Automa.jl - based lexer if we wanted to delve more deeply into performance optimization. I suspect a generated lexer would be a lot faster if unicode decoding were folded into the state machine.)
For now, I'm content to port fixes back and forth as required.
What do people think? @pfitzseb @kristofferc ?
I'm OK with soft deprecating Tokenize for full JuliaSyntax.
Same. The current situation (CSTParser being used, JuliaSyntax not so much) is a bit annoying, but the solution to that is to kill off CSTParser too :)
I've started a big refactor to integrate Tokenize better into JuliaSyntax and delete a lot of the shim code I had to put in.
I suspect a generated lexer would be a lot faster if unicode decoding were folded into the state machine.
https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html And code: [Unlicense] https://github.com/skeeto/branchless-utf8
That reverse UTF-8 decoding is interesting :+1:
I'd try Automa.jl first if I thought we needed really fast lexing as it's very Julia-native and the devs are receptive to the JuliaSyntax.jl use case. But I think we've got bigger fish to fry right now :)
WhenI checked last, the state machines as in https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ have been slower than what we have in Base already.
I think we've resolved the way forward with this and a big chunk of the rearrangement work was done in #40. Let's close this issue and work slowly toward more integration as necessary.