A lexer for the next generation KMN compiler, based on a modified port of Fowler's Java regex lexer (see Domain Specific Languages by Martin Fowler and Rebecca Parsons, Ch.20)

see Next Generation KMN Compiler #13349

@keymanapp-test-bot skip

To do list

[ ] Add tokens for remaining keywords/symbols, including:
- [x] ALWAYS
- [ ] BITMAPS
- [x] CAPS
- [x] COPYRIGHT
- [x] FREES
- [x] OFF
- [x] OLDCHARPOSMATCHING
- [x] ON
- [x] ONLY
- [x] SHIFT
- [x] Named constants
- [x] Compile targets
- [x] Hangeul syllables
- [x] decimal character codes
- [x] hex character codes
- [x] octal character codes
- [x] CLEARCONTEXT
- [x] FIX
- [x] BITMAP (header)
- [x] COPYRIGHT (header)
- [x] HOTKEY (header)
- [x] LANGUAGE (header)
- [x] LAYOUT
- [x] MESSAGE (header)
- [x] NAME (header)
- [x] VERSION (header)
[ ] Add errors for compile targets, decimal, hex and octal character codes
[ ] Add error callback for lexing failure
[ ] Add warnings for deprecated and downlevel tokens

Questions

Whitespace and Comments: are currently emitted as tokens to allow for careful control in the syntax analyser. This is at odds with many language compiler designs that do not emit whitespace and comments, but is believed necessary for Keyman language LSP support. Is this right? [made switchable]
Source Lines: are currently captured in the token._line field of NEWLINE and lexer generated EOF Tokens. This has been done to allow the parser to have access to the original sourcecode in the AST via included Tokens, allowing round-trip recreation of the .kmn soure file. Is this right/necessary? [yes]

Feb 25 '25 16:02 markcsinclair

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Developer
- Keyman Developer - build : all tests passed (no artifacts on BuildLevel "build")
- Compiler Regression Tests - build : all tests passed (no artifacts on BuildLevel "build")
- Keyman Developer (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
- kmcomp.zip - build : all tests passed (no artifacts on BuildLevel "build")
- kmcomp.zip (old PRs) - build : all tests passed (no artifacts on BuildLevel "build")
Keyboards
- Test Keyboards - build : all tests passed (no artifacts on BuildLevel "build")

Feb 25 '25 16:02 keymanapp-test-bot[bot]

Hmm ... merging master seems to have been a mistake sigh

Apr 14 '25 19:04 markcsinclair

Hmm ... merging master seems to have been a mistake sigh

The general process we use here is to merge master into the epic with a separate maintenance PR, then you can either rebase the child PR or merge the epic PR into this one. I am about to do a maintenance PR for the epic, and that should resolve the history on this PR once it all lands.

Apr 22 '25 06:04 mcdurdin

Ready for review. Errors/warnings will be added later. I will write up some notes on the lexer in Next Generation KMN Compiler shortly.

Aug 15 '25 10:08 markcsinclair

Have started review; it's a big PR so a fair bit to work through!

Sep 01 '25 12:09 mcdurdin

Just a note: the basic keyboard parser in the Keyman Developer IDE was written in Delphi a long time after the compiler. It makes a different set of assumptions about the validity of certain constructs (e.g. #14604). At this point, the compiler's assumptions are our source-of-truth, but it may be helpful to be aware of this. (Looking at the assumptions in KeyboardParser.pas, I think that they may be somewhat 'nicer' in many ways, but we are where we are!)

Sep 11 '25 11:09 mcdurdin

Added doc hyperlink for most TokenTypes (omitting e.g. COMMA)

Nov 07 '25 15:11 markcsinclair

Okay, I think that's all the review comments addressed - it was a bit hard to track due to the outdated code/refactor/filename change effects.

Nov 10 '25 11:11 markcsinclair

feat(developer): next generation KMN compiler - lexer 🤔

To do list

Questions

User Test Results

Test Artifacts