mlscript New UCS desugarer and rudimentary `PreTyper`

TL;DR

This PR adds new UCS desugarer and a rudimentary implementation of PreTyper.

Features

Desugarer

The new UCS desugarer replaces the old desugarer. The facilities old desugarer were also removed. The new desugarer has following facilities.

Two Sets of Syntax. We use two sets of syntax tree nodes: source abstract syntax and core abstract syntax. Both of them are implement according to our paper precisely in package mlscript.ucs.syntax.
Stage Classes. The new desugarer consists of five stages, which correspond five classes in package mlscript.ucs.stages. The stages are the following.
- Transformation: transforming parsed AST to source abstract syntax.
- Desugaring: generating let bindings and expand patterns.
- Normalization: removing backtracking and convert core abstract syntax to case branches.
- PostProcessing: merging case branches with the same scrutinee.
- CoverageChecking: detecting unreachable cases and duplicated cases.
Scrutinee Management. We tracks useful information of each scrutinee's patterns by Context, Scrutinee and Pattern.
- Each If term is associated with a unique Context. A Context keeps track of all scrutinee and is in charge of allocating temporary variable names.
- Scrutinee represents a term that is being matched in UCS expressions. It keeps the tally of all patterns the scrutinee has been matched against as well as occurrences and sub-scrutinees of each pattern.

We support two kinds of new patterns in this PR.

Simple Tuple Pattern is a syntax sugar for selecting fields. It cannot be used together with other patterns.
Alias Pattern allows people name the entire term represented by a pattern.

PreTyper

PreTyper traverses the AST and associate each variable with symbol it refers to.
A Hierarchy of Symbols. Each symbol class maintains certain information useful to desugarer.
Scope maintains an association from variables to symbols.

DiffTests

Local flag :ducs controls the new Desugarer right now. It supports sub-flags.
- :ducs:[stage] where [stage] is one of transform, desugar, normalize, postprocess, coverage will display the debug messages from the specified stage.
- :ducs:[stage].result shows the pretty-printed outcome of the specified stage.
- :ducs:desugared shows the desugared MLscript term.
- :ducs:A,B,C shows debug messages for all A, B, and C.
Global flag :ShowPreTyperErrors emits warnings and errors reported by PreTyper.

Others

The utility trait Traceable abstracts debugging methods used in Typer, including trace, println, etc.
Implement freeVars for Located. The original implementation is broken. But my implementation is not perfect either.
Support parsing _ then ... and else in IfOpsApp.
Fix code generation for case expression of Num, Int, Bool, and Str.
Fix display of escaped characters in StrLit.
Add utility methods foldLeft and foreach to CaseBranchesImpl.

Future Works

This PR does not implement all features, but it lays the foundation for some of the following features.

Branch deduplication. Currently, we duplicate branches during normalization, which generates many code in some examples (e.g., JSON parser). We can reduce branches by abstracting duplicated branches into local functions.
Better tuple patterns. As mentioned before, we only implement "simple" tuple patterns. It is not real tuple patterns because the underlying MLscript CaseOf doesn't support tuples right now. However, it paves the road for complicated tuple patterns even tuple splices.

Besides, we can still improve location reporting in many places.

Nov 23 '23 20:11 chengluyu

🥱 I merged (rather than rebase) new-definition-typing because there are just too many commits. The good news is that the changed test files are all related to UCS now.

Jan 02 '24 21:01 chengluyu

Is it on purpose that there are several deleted files in the "migration" commit? https://github.com/hkust-taco/mlscript/pull/194/commits/d6bf3a38642274f7384f78d5a53ac1a69491a6d2

Jan 03 '24 01:01 LPTK

Is it on purpose that there are several deleted files in the "migration" commit? d6bf3a3

Yes, that's on purpose. Actually, there are only two examples, each with two files, which are the ones with NewParser and NewDefs activated respectively. One example had already been migrated in a previous commit, so I deleted the old files.

Jan 03 '24 07:01 chengluyu

So can shared/src/test/diff/pretyper/Errors.mls be removed now? It says:

// This file is used to test the error messages of the _unfinished_ `PreTyper`.
// The goal is to document `PreTyper` behavior and to make sure that the error
// messages are clear and helpful. We can delete this file after it is done.

Or is the pretyper still not being used by default?

Feb 02 '24 00:02 LPTK

The PreTyper is being used now. But the error messages and warnings it generates will be hidden, because they are redundant with the information generated by Typer. The purpose of retaining this file is to allow us to observe its changes when improving the PreTyper in the next PR.

Feb 03 '24 05:02 chengluyu

It seems there's something wrong in the handling of refined with multiple branches. I just hit this:


module None
class Some[out A](val value: A)
//│ module None
//│ class Some[A](value: A)


// OK
:ducs:desugar.result,postprocess.result
x => if x is
  refined(None) then x
//│ Desugared UCS term:
//│ if x*‡ is refined None then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│   refined None*† -> x
//│ forall 'a. (None & 'a) -> (None & 'a)
//│ res
//│     = [Function: res]

// OK
:ducs:desugar.result,postprocess.result
x => if x is
  refined(Some) then x
//│ Desugared UCS term:
//│ if x*‡ is refined Some then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│   refined Some*◊ -> x
//│ forall 'a. (Some[anything] & 'a) -> (Some[anything] & 'a)
//│ res
//│     = [Function: res]

// NOT OK
:ducs:desugar.result,postprocess.result
x => if x is
  refined(None) then x
  refined(Some) then x
//│ Desugared UCS term:
//│ if
//│   x*‡ is refined None then x
//│   x*‡ is refined Some then x
//│ Post-processed UCS term:
//│ case x*‡ of
//│   refined None*† -> x
//│   Some*◊ -> x
//│ (None | Some[anything]) -> Some[nothing]
//│ res
//│     = [Function: res]

Feb 19 '24 03:02 LPTK

I'm fixing not being able to report some unreachable cases and conflicting cases. I'm handling the situation we discussed before: some duplicated branches can also trigger these checks. The solution I take is to mark the duplicated syntax tree nodes, so we will not trigger warnings if these nodes are detected.

Apr 28 '24 20:04 chengluyu

The solution I take is to mark the duplicated syntax tree nodes, so we will not trigger warnings if these nodes are detected.

That's a good enough short-term solution, but after this PR is merged we need to think of a more principled and correct way of handling this!

Apr 29 '24 02:04 LPTK

That's a good enough short-term solution,

This solution is more limited than I thought. I expected it to be able to detect all the duplicated cases. Because a split might be concatenated as a fallback in the specialization for different scrutinees, a Boolean flag is not enough. I tried to turn this boolean flag into a map, but I need to modify the map in many places. I haven't been able to make it work well until now, and the code will also become very messy.

I feel it can figure out most of the duplicated cases, but it can't detect some of the covered cases, such as the one below.

fun f(x) = if
  x is A and
    x is B then 1
  x is B then 2

but after this PR is merged we need to think of a more principled and correct way of handling this!

I think a neater approach is to check on the desugared (rather than normalized) syntax tree. Although we can know which branches are removed during specialization, we don't know if they are removed due to duplication and which branches they overlap with because they need non-local information.

Lastly, I've also improved the warning message for different duplicated cases at the same time. There're three types of duplication.

Simple duplication (mainly caused by identical patterns)

 ```
 :w
 fun f(x) = if x is
 	"a" then 1
 	"b" then 2
 	"a" then 3
 //│ ╔══[WARNING] found a duplicated case
 //│ ║  l.10: 	  "a" then 3
 //│ ║        	  ^^^
 //│ ╟── there is an identical pattern "a"
 //│ ║  l.8: 	  "a" then 1
 //│ ╙──     	  ^^^
 //│ fun f: ("a" | "b") -> (1 | 2)
 ```

Duplication caused by inheritance

 ```
 :w
 fun f(x) = if
 	x is A then 1
 	x is B then 2
 //│ ╔══[WARNING] found a duplicated case
 //│ ║  l.27: 	  x is B then 2
 //│ ║        	       ^
 //│ ╟── the case is covered by pattern A
 //│ ║  l.26: 	  x is A then 1
 //│ ║        	       ^
 //│ ╟── due to the subtyping relation
 //│ ║  l.7: 	class B() extends A()
 //│ ╙──     	^^^^^^^^^^^^^^^^^^^^^
 //│ fun f: A -> 1
 ```

Always-true pattern (mainly due to subtyping)

 ```
 :w
 fun f(x) = if
 	x is B and
 		x is A then 1
 		p(x) then 2
 	x is A then 31
 	x is B then 3
 	else 4
 //│ ╔══[WARNING] the pattern always matches
 //│ ║  l.67: 	    x is A then 1
 //│ ║        	         ^
 //│ ╟── the scrutinee was matched against B
 //│ ║  l.66: 	  x is B and
 //│ ║        	       ^
 //│ ╟── which is a subtype of A
 //│ ║  l.4: 	class B() extends A()
 //│ ╙──     	^^^^^^^^^^^^^^^^^^^^^
 //│ fun f: (B | Object & ~#B) -> (1 | 31 | 4)
 ```

Always-false pattern (mainly due to matching a scrutinee against unrelated constructors)

:w
	fun f(x) = if
		x is X and
			x is Y and
				x is Z then 1
		else 2
	//│ ╔══[WARNING] possibly conflicted patterns
	//│ ║  l.105: 	    x is Y and
	//│ ║         	         ^
	//│ ╟── the scrutinee was matched against X
	//│ ║  l.104: 	  x is X and
	//│ ║         	       ^
	//│ ╟── which is unrelated with Y
	//│ ║  l.105: 	    x is Y and
	//│ ╙──       	         ^
	//│ fun f: Object -> 2

May 07 '24 20:05 chengluyu

I think a neater approach is to check on the desugared (rather than normalized) syntax tree. Although we can know which branches are removed during specialization, we don't know if they are removed due to duplication and which branches they overlap with because they need non-local information.

Makes sense. This can go in a future PR!

May 08 '24 04:05 LPTK

It seems there's something wrong in the handling of refined with multiple branches.

https://github.com/hkust-taco/mlscript/pull/194#issuecomment-1951634075

BTW was this fixed or at least reproduced in the tests?

May 08 '24 04:05 LPTK

BTW was this fixed or at least reproduced in the tests?

Yes, this was fixed.

May 09 '24 02:05 chengluyu