compiler
compiler copied to clipboard
parser lexer
Very light on details here, I will flesh out the chat soon. In short, this will be an attempt to replace our current parser with a lexer/parser two stage parser (see this screencast from which I have taken inspiration).
The first commit adds a lexer that can (maybe, probably not) parse any valid elm program. Moreover, it should never fail to parse anything ever (see the Invalid LexItem into which I will stick anything that I cannot parse properly). I aim for the text => [ LexItems ] => text conversation change to never be lossy.
The only really interesting thing here at this point in time (and the reason I am putting this PR out) is the LexItem datastructure
https://github.com/elm-in-elm/compiler/blob/a2164213a49b51bbc72e5729db97ca692e583083/src/Stage/Parse/Lexer.elm#L8-L73
Thoughts on these custom types would be very much appreciated!
In my experience of elm/parser parser, the most painful thing is understanding what the state of parser is when it fails. The parser
- kindly spits out the location of the parser error
- gives some information about what it was trying to do when it failed
- but gives no information about its state when it fails.
If the lexer/parser approach is going to be superior to the elm/parser parser then it must do as good with (1) and (2) (which are the error information that users will see) but also help with (3) (this information is key to debugging or hacking around with the compiler).
I think the reason elm/parser parsers are poor at (3) is because they are a wrapper around an opaque function. Functions are in there nature impossible to compare and hard to visualise. Therefore, when I am designing the parser part of the lexer/parser I need to avoid the temptation to use function composition/lots of recursion. Instead I should try to craft some custom types which describe the state of the parser (what fraction of the AST it has produced so far) and compose these types. Then, if there is a problem I can dump this fraction of the AST which should help with debugging.
I think this is how I want to structure the parser part of the lexer/parser. (An entity is a word I assigned today to anything in an elm file consisting of a line with no indentation followed by zero or more lines with indentation --- for example a custom type declaration, a type annotation, a module XX exposing (..), etc).
type State
= EntityStart
| EntityFirstItem EntityFirstItemType
type EntityFirstItem
= EntityFirstItemType
| EntityFirstItemModule
| EntityFirstItemName
| ...
type Error
= **InvalidStartToEntity**
| MisplacedKeyword Keyword
parser : List (LexItem) -> SomeResult
parser items =
parserHelp items EntityStart
parserHelp : List (LexItem) -> State -> Result Error Ast
parserHelp items state =
case items of
item :: rest ->
let
newState =
case state of
EntityStart -> parseEntityStart item
EntityFirstItem EntityFirstItemType -> parseTypeEntity item
EntityFirstItem EntityFirstItemModule -> parseModuleEntity item
EntityFirstItem EntityFirstItemName -> parseValueDeclarationEntity item
in
case newState of ->
Ok newState_ -> parserHelp rest newState_
Err Error -> Err Error
[] ->
Ok (AstFromSate state)
parseEntityStart : LexItem -> Result Error State
parseEntityStart item =
case item of
Lexer.Token str ->
case toKeyword str of
Just TypeKeyword -> Ok EntityFirstItemType
Just ModuleKeyword -> Ok EntityFirstItemModule
Just other -> Ok MisplacedKeyword other
Nothing -> Ok EntityFirstItemName str
_ -> Err InvalidStartToEntity item
Very interesting, thanks for prototyping that @harrysarson ! How do you feel about it so far? (Let's take this to the Discord if you want.)
I feel like we still have too little information to decide on whether to use this approach or not, since the other part that will take these LexItems and produce AST still needs to be written (if I'm correct), but so far I see no problems/blockers that would make me want to scratch this idea :slightly_smiling_face: :+1:
I feel good! The progress I have made so far definately counts more as exploration than implementation. It may be that we decide to scrap this entirely and I do not think that would be a bad outcome because already I understand the problem scope a lot more than I did before. (For example, I am a big fan of these 3 aims for whatever solution we go for).
No rush either, I definitetly want to spend some time sitting on these ideas before committing myself to them.
I have run out of steam here for now. Closing in the hope that I one day soon cycle back to this and make more progress.
Thanks @harrysarson for the experiment! Yeah we can always circle back to it :)
I have circled back. I am very excited about this approach. Still very rough but I can parse type aliases that do not include records or tuples!
@janiczek as part of this PR I have experimented with auto generating tests. I have a directory of elm syntax snippets and a node/elm program that generates test cases with sources, the elm list of lexed items and the pasrsed AST. Then I use each test case for a couple of tests. Pretty cool! Currently it relies on Debug.toString producing valid elm code if I can the correct glob imports. Even if parser/lexer turns out to be a dead end, I think testing method is valuable.