feat(developer): next generation KMN compiler - lexer 🤔
A lexer for the next generation KMN compiler, based on a modified port of Fowler's Java regex lexer (see Domain Specific Languages by Martin Fowler and Rebecca Parsons, Ch.20)
see Next Generation KMN Compiler #13349
@keymanapp-test-bot skip
To do list
- [ ] Add tokens for remaining keywords/symbols, including:
- [x] ALWAYS
- [ ] BITMAPS
- [x] CAPS
- [x] COPYRIGHT
- [x] FREES
- [x] OFF
- [x] OLDCHARPOSMATCHING
- [x] ON
- [x] ONLY
- [x] SHIFT
- [x] Named constants
- [x] Compile targets
- [x] Hangeul syllables
- [x] decimal character codes
- [x] hex character codes
- [x] octal character codes
- [x] CLEARCONTEXT
- [x] FIX
- [x] BITMAP (header)
- [x] COPYRIGHT (header)
- [x] HOTKEY (header)
- [x] LANGUAGE (header)
- [x] LAYOUT
- [x] MESSAGE (header)
- [x] NAME (header)
- [x] VERSION (header)
- [ ] Add errors for compile targets, decimal, hex and octal character codes
- [ ] Add error callback for lexing failure
- [ ] Add warnings for deprecated and downlevel tokens
Questions
- Whitespace and Comments: are currently emitted as tokens to allow for careful control in the syntax analyser. This is at odds with many language compiler designs that do not emit whitespace and comments, but is believed necessary for Keyman language LSP support. Is this right? [made switchable]
- Source Lines: are currently captured in the
token._linefield ofNEWLINEand lexer generatedEOFTokens. This has been done to allow the parser to have access to the original sourcecode in the AST via includedTokens, allowing round-trip recreation of the.kmnsoure file. Is this right/necessary? [yes]
User Test Results
Test specification and instructions
User tests are not required
Test Artifacts
- Developer
- Keyman Developer - build : all tests passed (no artifacts on BuildLevel "build")
- Compiler Regression Tests - build : all tests passed (no artifacts on BuildLevel "build")
- Keyman Developer (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
- kmcomp.zip - build : all tests passed (no artifacts on BuildLevel "build")
- kmcomp.zip (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
- Keyboards
- Test Keyboards - build : all tests passed (no artifacts on BuildLevel "build")
Hmm ... merging master seems to have been a mistake sigh
Hmm ... merging master seems to have been a mistake sigh
The general process we use here is to merge master into the epic with a separate maintenance PR, then you can either rebase the child PR or merge the epic PR into this one. I am about to do a maintenance PR for the epic, and that should resolve the history on this PR once it all lands.
Ready for review. Errors/warnings will be added later. I will write up some notes on the lexer in Next Generation KMN Compiler shortly.
Have started review; it's a big PR so a fair bit to work through!
Just a note: the basic keyboard parser in the Keyman Developer IDE was written in Delphi a long time after the compiler. It makes a different set of assumptions about the validity of certain constructs (e.g. #14604). At this point, the compiler's assumptions are our source-of-truth, but it may be helpful to be aware of this. (Looking at the assumptions in KeyboardParser.pas, I think that they may be somewhat 'nicer' in many ways, but we are where we are!)
Added doc hyperlink for most TokenTypes (omitting e.g. COMMA)
Okay, I think that's all the review comments addressed - it was a bit hard to track due to the outdated code/refactor/filename change effects.