Haskell Parser
As it currently stands, there are 17 open issues with the current Haskell parser, and counting. Here are my thoughts on the resolution -
- [ ] We have two lexers, and both probably deserve a rewrite.
- The syntax highlighting lexer is bad copy-pasta from the parsing lexer, so it can probably be simplified greatly.
- The parsing lexer probably needs a complete rethink, considering whitespace-sensitive indentation first which seems to be the cause of the trickiest and most longstanding parser bugs.
- [ ] The parser code is written in BNF notation using Grammar Kit. I'm not sure how sustainable this is. Debugging the parser is extremely difficult since we're dealing with generated code, not to mention re-generating the parser after making changes creates horrible diffs since we can't generate the parser code in CI.
- [ ] It's highly likely that a handwritten parser targeting the IntelliJ parsing API will be the best solution here, making it simpler to account for the special cases and drop into a debugger when things go wrong. Whitespace-sensitive indentation will be much simpler to encode and handle manually than figuring out how to hack it to work with Grammar Kit (although, the lexer is probably partly to blame for this as well).
- [ ] Once we get the parser in a better state, we will then need to reimplement features which relied on the old parser (reference resolve, autocompletion, etc.). Note that by re-writing the parser, we can probably leverage the PSI much better and thus make a lot of these features more natural to implement.
This will be a big undertaking, but once complete I think we'll see a much better product. Once this is done, I plan on releasing it as 0.4-rc1. I will start work on this after completing #169.
I was thinking about how haskell-src-exts might be a possible alternative to implementing the parser ourselves. Note that we had attempted this before by using an external tool, parser-helper, which serialized haskell-src-exts to JSON. However, there are a few problems I see with this.
- IntelliJ uses the lexer/parser internally, so it may be difficult (and probably even a hack) to try to use an external parser and then somehow instruct IntelliJ how to connect the text source locations with the nodes returned from the external parser.
- We would need to create an RPC server (or the like) which consumes text, parses using haskell-src-exts, serializes to binary, then returns the binary encoded data. We'd then need to represent all of the data types, probably as Java objects (which don't have the correctness of Scala types), in the plugin which could then be deserialized. This may or may not be less performant than a proper parser implemented directly in IntelliJ.
From what I recall, the problem with parser-helper was that it lacked in performance. This might have been improved by implementing it as an RPC server and using a binary serialization protocol instead of JSON (point 2), but also the HaskellParser2 class had to some how had to connect our internal lexer with haskell-src-exts' AST (point 1). This seems pretty complicated and error prone.
While I prefer being able to build off of existing work (haskell-src-exts) it seems that in this case implementing the parser ourselves may still be the best option.
Here's a decent overview of the IntelliJ Psi Parsing API -
http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html
In talking with @rahulmutt, it may be possible in the near future to compile alex with Eta. This might make it possible to implement an IntelliJ lexer based on GHC's implementation.
If that proves to be workable, then writing an IntelliJ PsiParser by hand on top of that lexer would probably be ideal, particularly since most parser bugs appears to be problems with the layout the lexer produces. Here's a recent one https://github.com/carymrobbins/intellij-haskforce/issues/333 and its fix https://github.com/carymrobbins/intellij-haskforce/pull/334.
A big reason why we should prefer to write the parser by hand is that it must be recovering. GHC's parser, however, does not recover, so this would prevent any sort of source analysis for an invalid source file. This is terrible when developing code and all modern IDEs deal with this by implementing a recovering parser (just try out Java or Scala support in IntelliJ).
It's probably a good idea to review the layout rules as described in the Haskell 2010 report
https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17800010.3
Upon playing with haskell-src-exts' lexer, it seems it doesn't actually produce the layout (i.e. there are no indent/dedent tokens), which leads me to believe that the layout is entirely handled by the parser using the source positions reported by the lexer. Given the details and rules in the Haskell 2010 report, this seems like a potentially reasonable solution for us as well. However, this means that integrating HSE's lexer, while possibly being useful, won't help us solve problems with the layout. That will have to be solved by hand or by patching an existing parser to support recovery.
λ :m + Language.Haskell.Exts.Lexer
λ lexTokenStream "{-# LANGUAGE TemplateHaskell, QuasiQuotes #-}\nmodule Main where\nmain = do\n putStrLn $(foo)\n where\n foo = \"bam\""
ParseOk
[ Loc { loc = SrcSpan "<unknown>.hs" 1 1 1 13 , unLoc = LANGUAGE }
, Loc
{ loc = SrcSpan "<unknown>.hs" 1 14 1 29
, unLoc = ConId "TemplateHaskell"
}
, Loc { loc = SrcSpan "<unknown>.hs" 1 29 1 30 , unLoc = Comma }
, Loc
{ loc = SrcSpan "<unknown>.hs" 1 31 1 42
, unLoc = ConId "QuasiQuotes"
}
, Loc
{ loc = SrcSpan "<unknown>.hs" 1 43 1 46 , unLoc = PragmaEnd }
, Loc { loc = SrcSpan "<unknown>.hs" 2 1 2 7 , unLoc = KW_Module }
, Loc
{ loc = SrcSpan "<unknown>.hs" 2 8 2 12 , unLoc = ConId "Main" }
, Loc { loc = SrcSpan "<unknown>.hs" 2 13 2 18 , unLoc = KW_Where }
, Loc
{ loc = SrcSpan "<unknown>.hs" 3 1 3 5 , unLoc = VarId "main" }
, Loc { loc = SrcSpan "<unknown>.hs" 3 6 3 7 , unLoc = Equals }
, Loc { loc = SrcSpan "<unknown>.hs" 3 8 3 10 , unLoc = KW_Do }
, Loc
{ loc = SrcSpan "<unknown>.hs" 4 3 4 11
, unLoc = VarId "putStrLn"
}
, Loc
{ loc = SrcSpan "<unknown>.hs" 4 12 4 13 , unLoc = VarSym "$" }
, Loc
{ loc = SrcSpan "<unknown>.hs" 4 13 4 14 , unLoc = LeftParen }
, Loc
{ loc = SrcSpan "<unknown>.hs" 4 14 4 17 , unLoc = VarId "foo" }
, Loc
{ loc = SrcSpan "<unknown>.hs" 4 17 4 18 , unLoc = RightParen }
, Loc { loc = SrcSpan "<unknown>.hs" 5 3 5 8 , unLoc = KW_Where }
, Loc
{ loc = SrcSpan "<unknown>.hs" 6 3 6 6 , unLoc = VarId "foo" }
, Loc { loc = SrcSpan "<unknown>.hs" 6 7 6 8 , unLoc = Equals }
, Loc
{ loc = SrcSpan "<unknown>.hs" 6 9 6 14
, unLoc = StringTok ( "bam" , "bam" )
}
]