New UCS desugarer and rudimentary `PreTyper`
TL;DR
This PR adds new UCS desugarer and a rudimentary implementation of PreTyper.
Features
Desugarer
The new UCS desugarer replaces the old desugarer. The facilities old desugarer were also removed. The new desugarer has following facilities.
- Two Sets of Syntax. We use two sets of syntax tree nodes: source abstract syntax and core abstract syntax. Both of them are implement according to our paper precisely in package
mlscript.ucs.syntax. - Stage Classes. The new desugarer consists of five stages, which correspond five classes in package
mlscript.ucs.stages. The stages are the following.Transformation: transforming parsed AST to source abstract syntax.Desugaring: generating let bindings and expand patterns.Normalization: removing backtracking and convert core abstract syntax to case branches.PostProcessing: merging case branches with the same scrutinee.CoverageChecking: detecting unreachable cases and duplicated cases.
- Scrutinee Management. We tracks useful information of each scrutinee's patterns by
Context,ScrutineeandPattern.- Each
Ifterm is associated with a uniqueContext. AContextkeeps track of all scrutinee and is in charge of allocating temporary variable names. Scrutineerepresents a term that is being matched in UCS expressions. It keeps the tally of all patterns the scrutinee has been matched against as well as occurrences and sub-scrutinees of each pattern.
- Each
We support two kinds of new patterns in this PR.
- Simple Tuple Pattern is a syntax sugar for selecting fields. It cannot be used together with other patterns.
- Alias Pattern allows people name the entire term represented by a pattern.
PreTyper
PreTypertraverses the AST and associate each variable with symbol it refers to.- A Hierarchy of
Symbols. Each symbol class maintains certain information useful to desugarer. Scopemaintains an association from variables to symbols.
DiffTests
- Local flag
:ducscontrols the newDesugarerright now. It supports sub-flags.:ducs:[stage]where[stage]is one oftransform,desugar,normalize,postprocess,coveragewill display the debug messages from the specified stage.:ducs:[stage].resultshows the pretty-printed outcome of the specified stage.:ducs:desugaredshows the desugared MLscript term.:ducs:A,B,Cshows debug messages for allA,B, andC.
- Global flag
:ShowPreTyperErrorsemits warnings and errors reported byPreTyper.
Others
- The utility trait
Traceableabstracts debugging methods used inTyper, includingtrace,println, etc. - Implement
freeVarsforLocated. The original implementation is broken. But my implementation is not perfect either. - Support parsing
_ then ...andelseinIfOpsApp. - Fix code generation for case expression of
Num,Int,Bool, andStr. - Fix display of escaped characters in
StrLit. - Add utility methods
foldLeftandforeachtoCaseBranchesImpl.
Future Works
This PR does not implement all features, but it lays the foundation for some of the following features.
- Branch deduplication. Currently, we duplicate branches during normalization, which generates many code in some examples (e.g., JSON parser). We can reduce branches by abstracting duplicated branches into local functions.
- Better tuple patterns. As mentioned before, we only implement "simple" tuple patterns. It is not real tuple patterns because the underlying MLscript
CaseOfdoesn't support tuples right now. However, it paves the road for complicated tuple patterns even tuple splices.
Besides, we can still improve location reporting in many places.
🥱 I merged (rather than rebase) new-definition-typing because there are just too many commits. The good news is that the changed test files are all related to UCS now.
Is it on purpose that there are several deleted files in the "migration" commit? https://github.com/hkust-taco/mlscript/pull/194/commits/d6bf3a38642274f7384f78d5a53ac1a69491a6d2
Is it on purpose that there are several deleted files in the "migration" commit? d6bf3a3
Yes, that's on purpose. Actually, there are only two examples, each with two files, which are the ones with NewParser and NewDefs activated respectively. One example had already been migrated in a previous commit, so I deleted the old files.
So can shared/src/test/diff/pretyper/Errors.mls be removed now? It says:
// This file is used to test the error messages of the _unfinished_ `PreTyper`.
// The goal is to document `PreTyper` behavior and to make sure that the error
// messages are clear and helpful. We can delete this file after it is done.
Or is the pretyper still not being used by default?
The PreTyper is being used now. But the error messages and warnings it generates will be hidden, because they are redundant with the information generated by Typer. The purpose of retaining this file is to allow us to observe its changes when improving the PreTyper in the next PR.
It seems there's something wrong in the handling of refined with multiple branches. I just hit this:
module None
class Some[out A](val value: A)
//│ module None
//│ class Some[A](value: A)
// OK
:ducs:desugar.result,postprocess.result
x => if x is
refined(None) then x
//│ Desugared UCS term:
//│ if x*‡ is refined None then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│ refined None*† -> x
//│ forall 'a. (None & 'a) -> (None & 'a)
//│ res
//│ = [Function: res]
// OK
:ducs:desugar.result,postprocess.result
x => if x is
refined(Some) then x
//│ Desugared UCS term:
//│ if x*‡ is refined Some then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│ refined Some*◊ -> x
//│ forall 'a. (Some[anything] & 'a) -> (Some[anything] & 'a)
//│ res
//│ = [Function: res]
// NOT OK
:ducs:desugar.result,postprocess.result
x => if x is
refined(None) then x
refined(Some) then x
//│ Desugared UCS term:
//│ if
//│ x*‡ is refined None then x
//│ x*‡ is refined Some then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│ refined None*† -> x
//│ Some*◊ -> x
//│ (None | Some[anything]) -> Some[nothing]
//│ res
//│ = [Function: res]
I'm fixing not being able to report some unreachable cases and conflicting cases. I'm handling the situation we discussed before: some duplicated branches can also trigger these checks. The solution I take is to mark the duplicated syntax tree nodes, so we will not trigger warnings if these nodes are detected.
The solution I take is to mark the duplicated syntax tree nodes, so we will not trigger warnings if these nodes are detected.
That's a good enough short-term solution, but after this PR is merged we need to think of a more principled and correct way of handling this!
That's a good enough short-term solution,
This solution is more limited than I thought. I expected it to be able to detect all the duplicated cases. Because a split might be concatenated as a fallback in the specialization for different scrutinees, a Boolean flag is not enough. I tried to turn this boolean flag into a map, but I need to modify the map in many places. I haven't been able to make it work well until now, and the code will also become very messy.
I feel it can figure out most of the duplicated cases, but it can't detect some of the covered cases, such as the one below.
fun f(x) = if
x is A and
x is B then 1
x is B then 2
but after this PR is merged we need to think of a more principled and correct way of handling this!
I think a neater approach is to check on the desugared (rather than normalized) syntax tree. Although we can know which branches are removed during specialization, we don't know if they are removed due to duplication and which branches they overlap with because they need non-local information.
Lastly, I've also improved the warning message for different duplicated cases at the same time. There're three types of duplication.
-
Simple duplication (mainly caused by identical patterns)
``` :w fun f(x) = if x is "a" then 1 "b" then 2 "a" then 3 //│ ╔══[WARNING] found a duplicated case //│ ║ l.10: "a" then 3 //│ ║ ^^^ //│ ╟── there is an identical pattern "a" //│ ║ l.8: "a" then 1 //│ ╙── ^^^ //│ fun f: ("a" | "b") -> (1 | 2) ``` -
Duplication caused by inheritance
``` :w fun f(x) = if x is A then 1 x is B then 2 //│ ╔══[WARNING] found a duplicated case //│ ║ l.27: x is B then 2 //│ ║ ^ //│ ╟── the case is covered by pattern A //│ ║ l.26: x is A then 1 //│ ║ ^ //│ ╟── due to the subtyping relation //│ ║ l.7: class B() extends A() //│ ╙── ^^^^^^^^^^^^^^^^^^^^^ //│ fun f: A -> 1 ``` -
Always-true pattern (mainly due to subtyping)
``` :w fun f(x) = if x is B and x is A then 1 p(x) then 2 x is A then 31 x is B then 3 else 4 //│ ╔══[WARNING] the pattern always matches //│ ║ l.67: x is A then 1 //│ ║ ^ //│ ╟── the scrutinee was matched against B //│ ║ l.66: x is B and //│ ║ ^ //│ ╟── which is a subtype of A //│ ║ l.4: class B() extends A() //│ ╙── ^^^^^^^^^^^^^^^^^^^^^ //│ fun f: (B | Object & ~#B) -> (1 | 31 | 4) ``` -
Always-false pattern (mainly due to matching a scrutinee against unrelated constructors)
:w fun f(x) = if x is X and x is Y and x is Z then 1 else 2 //│ ╔══[WARNING] possibly conflicted patterns //│ ║ l.105: x is Y and //│ ║ ^ //│ ╟── the scrutinee was matched against X //│ ║ l.104: x is X and //│ ║ ^ //│ ╟── which is unrelated with Y //│ ║ l.105: x is Y and //│ ╙── ^ //│ fun f: Object -> 2
I think a neater approach is to check on the desugared (rather than normalized) syntax tree. Although we can know which branches are removed during specialization, we don't know if they are removed due to duplication and which branches they overlap with because they need non-local information.
Makes sense. This can go in a future PR!
It seems there's something wrong in the handling of
refinedwith multiple branches.
https://github.com/hkust-taco/mlscript/pull/194#issuecomment-1951634075
BTW was this fixed or at least reproduced in the tests?
BTW was this fixed or at least reproduced in the tests?
Yes, this was fixed.