compiler parser lexer

Very light on details here, I will flesh out the chat soon. In short, this will be an attempt to replace our current parser with a lexer/parser two stage parser (see this screencast from which I have taken inspiration).

The first commit adds a lexer that can (maybe, probably not) parse any valid elm program. Moreover, it should never fail to parse anything ever (see the Invalid LexItem into which I will stick anything that I cannot parse properly). I aim for the text => [ LexItems ] => text conversation change to never be lossy.

Aug 04 '20 22:08 harrysarson

The only really interesting thing here at this point in time (and the reason I am putting this PR out) is the LexItem datastructure

https://github.com/elm-in-elm/compiler/blob/a2164213a49b51bbc72e5729db97ca692e583083/src/Stage/Parse/Lexer.elm#L8-L73

Thoughts on these custom types would be very much appreciated!

Aug 04 '20 22:08 harrysarson

In my experience of elm/parser parser, the most painful thing is understanding what the state of parser is when it fails. The parser

kindly spits out the location of the parser error
gives some information about what it was trying to do when it failed
but gives no information about its state when it fails.

If the lexer/parser approach is going to be superior to the elm/parser parser then it must do as good with (1) and (2) (which are the error information that users will see) but also help with (3) (this information is key to debugging or hacking around with the compiler).

I think the reason elm/parser parsers are poor at (3) is because they are a wrapper around an opaque function. Functions are in there nature impossible to compare and hard to visualise. Therefore, when I am designing the parser part of the lexer/parser I need to avoid the temptation to use function composition/lots of recursion. Instead I should try to craft some custom types which describe the state of the parser (what fraction of the AST it has produced so far) and compose these types. Then, if there is a problem I can dump this fraction of the AST which should help with debugging.

Aug 05 '20 20:08 harrysarson

I think this is how I want to structure the parser part of the lexer/parser. (An entity is a word I assigned today to anything in an elm file consisting of a line with no indentation followed by zero or more lines with indentation --- for example a custom type declaration, a type annotation, a module XX exposing (..), etc).


type State
    = EntityStart
    | EntityFirstItem EntityFirstItemType

type EntityFirstItem
    = EntityFirstItemType
    | EntityFirstItemModule
    | EntityFirstItemName
    | ...

type Error
    = **InvalidStartToEntity**
    | MisplacedKeyword Keyword

parser : List (LexItem) -> SomeResult
parser items =
    parserHelp items EntityStart


parserHelp : List (LexItem) -> State -> Result Error Ast
parserHelp items state =
    case items of
        item :: rest ->
            let
                newState =
                    case state of
                        EntityStart -> parseEntityStart item
                        EntityFirstItem EntityFirstItemType -> parseTypeEntity item
                        EntityFirstItem EntityFirstItemModule -> parseModuleEntity item
                        EntityFirstItem EntityFirstItemName -> parseValueDeclarationEntity item

            in  
            case newState of ->
                Ok newState_ -> parserHelp rest newState_
                Err Error -> Err Error

        [] ->
            Ok (AstFromSate state)


parseEntityStart : LexItem -> Result Error State
parseEntityStart item =
    case item of
        Lexer.Token str -> 
            case toKeyword str of
                Just TypeKeyword -> Ok EntityFirstItemType
                Just ModuleKeyword -> Ok EntityFirstItemModule

                Just other -> Ok MisplacedKeyword other

                Nothing -> Ok EntityFirstItemName str

        _ -> Err InvalidStartToEntity item

Aug 05 '20 23:08 harrysarson

Very interesting, thanks for prototyping that @harrysarson ! How do you feel about it so far? (Let's take this to the Discord if you want.)

I feel like we still have too little information to decide on whether to use this approach or not, since the other part that will take these LexItems and produce AST still needs to be written (if I'm correct), but so far I see no problems/blockers that would make me want to scratch this idea :slightly_smiling_face: :+1:

Aug 06 '20 07:08 Janiczek

I feel good! The progress I have made so far definately counts more as exploration than implementation. It may be that we decide to scrap this entirely and I do not think that would be a bad outcome because already I understand the problem scope a lot more than I did before. (For example, I am a big fan of these 3 aims for whatever solution we go for).

No rush either, I definitetly want to spend some time sitting on these ideas before committing myself to them.

Aug 06 '20 12:08 harrysarson

I have run out of steam here for now. Closing in the hope that I one day soon cycle back to this and make more progress.

Aug 10 '20 18:08 harrysarson

Thanks @harrysarson for the experiment! Yeah we can always circle back to it :)

Aug 11 '20 08:08 Janiczek

I have circled back. I am very excited about this approach. Still very rough but I can parse type aliases that do not include records or tuples!

@janiczek as part of this PR I have experimented with auto generating tests. I have a directory of elm syntax snippets and a node/elm program that generates test cases with sources, the elm list of lexed items and the pasrsed AST. Then I use each test case for a couple of tests. Pretty cool! Currently it relies on Debug.toString producing valid elm code if I can the correct glob imports. Even if parser/lexer turns out to be a dead end, I think testing method is valuable.

Sep 25 '20 11:09 harrysarson

compiler compiler copied to clipboard

parser lexer

compiler
compiler copied to clipboard