JuliaSyntax.jl Where should the lexer live?

trafficstars

In https://github.com/JuliaLang/JuliaSyntax.jl/issues/31#issuecomment-1164175383, @pfitzseb said

Btw, we really should think about upstreaming the Tokenize changes in this repo... Pretty sure the opsuffix changes for &&/|| are implemented there.

We've chatted about this in various places and I mention it in the README. I'd like to resolve the double maintenance problem in some way, for sure :-)

But having modified Tokenize fairly extensively, I'm unsure whether the lexer should be versioned separately from the parser. Currently I see the lexer as serving the needs of parsing rather than something which is independent. Particularly because

Lexing Julia correctly is impossible without keeping state. Worse, that state needs to be recursive for nested string interpolations. Other cases which need state or lookahead are prime (#25) and various contextual keywords like outer. It's possible to add state to the lexer itself, but that's annoyingly redundant. And the redundancy of state becomes much worse when you consider recovery from malformed string interpolations.
JuliaSyntax can give you the disambiguated token stream in a flat format out of ParseStream. It's fairly lightweight, no need to opt into Expr (or other) tree building!
Parsing+lexing is currently only about half as fast as pure lexing.

So with those in mind, I feel like we could just recommend people use the full parser for purposes we previously used Tokenize.jl for? And that more tightly integrating the tokenizer source into JuliaSyntax might be best.

(Somewhat of a side note — I've also wondered whether we could do an Automa.jl - based lexer if we wanted to delve more deeply into performance optimization. I suspect a generated lexer would be a lot faster if unicode decoding were folded into the state machine.)

For now, I'm content to port fixes back and forth as required.

What do people think? @pfitzseb @kristofferc ?

Jun 24 '22 04:06 c42f

I'm OK with soft deprecating Tokenize for full JuliaSyntax.

Jun 24 '22 06:06 KristofferC

Same. The current situation (CSTParser being used, JuliaSyntax not so much) is a bit annoying, but the solution to that is to kill off CSTParser too :)

Jun 24 '22 07:06 pfitzseb

I've started a big refactor to integrate Tokenize better into JuliaSyntax and delete a lot of the shim code I had to put in.

Aug 07 '22 20:08 c42f

I suspect a generated lexer would be a lot faster if unicode decoding were folded into the state machine.

https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html And code: [Unlicense] https://github.com/skeeto/branchless-utf8

Aug 18 '22 07:08 inkydragon

That reverse UTF-8 decoding is interesting :+1:

I'd try Automa.jl first if I thought we needed really fast lexing as it's very Julia-native and the devs are receptive to the JuliaSyntax.jl use case. But I think we've got bigger fish to fry right now :)

Aug 19 '22 07:08 c42f

WhenI checked last, the state machines as in https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ have been slower than what we have in Base already.

Aug 19 '22 08:08 KristofferC

I think we've resolved the way forward with this and a big chunk of the rearrangement work was done in #40. Let's close this issue and work slowly toward more integration as necessary.

Aug 23 '22 11:08 c42f

JuliaSyntax.jl JuliaSyntax.jl copied to clipboard

Where should the lexer live?

JuliaSyntax.jl
JuliaSyntax.jl copied to clipboard