lean4 Make token parser customizable

trafficstars

In theory, users can implement arbitrary lexical grammars as the built-in token parser is merely another combinator whose use can be avoided; in practice, avoiding it also means forgoing the existing infrastructure around token caching and further parsers built on top of tokenFn, such as symbol implicitly used by syntax.

There are other nice-to-haves we have talked about before like per-category token tables, but this PR implements the most simple change to address the issues mentioned above: it makes the implementation of tokenFn customizable such that people can implement their own tokenization in it without losing access to all the infrastructure built on top of it.

Implementation-wise, the change mostly consist of further splitting ParserContextCore/ParserFn to prevent non-wellfounded type recursion and moving the Lean language's tokenizer into a new module Lean.Parser.LeanToken to distinguish it from language-generic parsers in Basic (which is too big anyway). Making it easier to adapt these Lean token functions for implementing a custom token parser is left as future work.

The change is validated by a simple test that changes the whitespace/comment syntax of the existing Lean language as well as an incomplete implementation of the venerable Whitespace language.

Jun 27 '23 10:06 Kha

!bench

Jun 28 '23 08:06 Kha

Here are the benchmark results for commit 1650f946bcfc4fca991451f8a082b9c913cb042d. There were significant changes against commit bb8cc08de85f3daf67c6fb3feda245681c4b56fd:

  Benchmark          Metric             Change
  =======================================================
+ stdlib             tactic execution    -2.8% (-483.8 σ)
+ workspaceSymbols   branch-misses       -3.7%  (-16.7 σ)
- workspaceSymbols   task-clock           3.7%   (20.3 σ)
- workspaceSymbols   wall-clock           3.7%   (20.3 σ)

Jun 28 '23 09:06 leanprover-bot

I have now benchmarked the PR on https://github.com/Kha/C-parsing-for-Lean4/tree/custom-tokens as well, before applying any changes to the actual parsing. Performance slightly increases, which doesn't really make sense and as such is probably a linker artifact, but it is at least further evidence that performance of parsing is not negatively affected

$ lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.233 s ±  0.008 s    [User: 1.139 s, System: 0.095 s]
  Range (min … max):    1.224 s …  1.251 s    10 runs
$ git co custom-tokens; lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.187 s ±  0.007 s    [User: 1.102 s, System: 0.087 s]
  Range (min … max):    1.178 s …  1.198 s    10 runs

Jun 29 '23 10:06 Kha

Just to note some alternatives: instead of making the token parser function dynamically replaceable, we could also "simply" demand embedded languages to call their own token parser function directly. This would involve duplication around all those little functions that build on top of the token parser though, and syntax as well would somehow have to be taught to use a different symbol function. Making the switch to a different token parser explicit via the parser context also means we can reliably enforce that the token cache is cleared at that point like done in this PR. Furthermore, we could tie token parsers to syntax categories such that they are automatically activated when the category is entered (much like we might want to do with per-category token tables?). This is something that can be done on top of this PR however.

Jul 19 '23 06:07 Kha

Let's discuss this PR in the next meeting.

Jul 28 '23 15:07 leodemoura

!bench

Aug 08 '23 16:08 Kha

Here are the benchmark results for commit a1138ef1d0a874439f3fb44d3142c0472eb2ce32. There were significant changes against commit bb738796ae13e1fff6dec921b1a26dae3a040498:

  Benchmark   Metric         Change
  ===========================================
- parser      branches         1.8% (197.6 σ)
- parser      instructions     1.9% (275.7 σ)

Aug 08 '23 23:08 leanprover-bot

@Kha I noticed that you removed the whitespace customization. While it may be really used in the Lean core, I think it will be very useful for DSLs. Particularly, for those that are meant to function in both an embedded context and standalone. For example, in Alloy, when parsing C code in a Lean file, using Lean-style comments is ideal. However, when parsing a C file itself using C-style comments is necessary. Being able to use the same grammar for both would be amazing and would be enabled by the whitespace configuration.

EDIT: Oh, upon further examination of the changes, I noticed the Lean tokens never made use of the context whitespaceFn. Well, it would be nice if they did, 😁

Oct 13 '23 18:10 tydeu

Existing token functions can simply be parameterized over the whitespace parser if desired, the parser framework doesn't have to be changed for that

Oct 13 '23 20:10 Kha

@Kha

Existing token functions can simply be parameterized over the whitespace parser if desired,

In my case, that would still require a separate grammar for C w/ Lean comments and C w/ C comments. Right? (Since the whitespace function has to be determine a syntax level in such a setup rather than at parser runtime. The only other solution I can see would be to parameterize the entire grammar which would prohibit the use of syntax.)

Oct 13 '23 20:10 tydeu

lean4 lean4 copied to clipboard

Make token parser customizable

lean4
lean4 copied to clipboard