RFCs
RFCs copied to clipboard
Proper sum types and structural pattern matching
Abstract
Proper sum types and structural pattern matching for Nim by the introduction of union and extension of case with structural matching and and where clauses. Heavily inspired by Rust, Swift, Racket, Python, and others, and adapted to Nim's particular type system and syntax.
Motivation
I write a lot of Nim code, and I write a lot of Rust code. I find both to be excellent languages with their own unique tradeoffs and specialties. But there is one thing in particular I greatly miss when going from Rust to Nim: and that is algebraic data types and pattern matching.
Algebraic data types (ADTs), for those unfamiliar, are an extremely powerful data structure that are useful for modelling a wide variety of types. They allow, by the composition of product types and sum types (as well as ordinary types), the modelling a wide variety of different data structures.
Nim does have algebraic data types - objects may be composed arbitrarily. Their application to product types is straightforward: product types are directly analogous to structs (or in Nim, standard objects). Sum types complement product types and can be thought of as their dual: they act as a wrapper around different types, letting you interact with their inner structure only by safely pattern matching upon the container. They are best represented by tagged unions, however, which Nim lacks a direct implementation of.
Variant objects are capable of modelling tagged unions and thus sum types. These are useful as, well, variants of objects, but have a number of serious drawbacks when used as sum types that the first portion of this RFC aims to address.
- The names of fields in variant objects must be unique - they cannot be repeated, even across different variants.
- A separate enum must function as a "kind" tag and be discriminated on by case within the object. Having a separate single-use enum is undesirable, and having to discriminate against it is awkward.
- Being forced into an object structure is awkward: the "type wrapper" approach of tagged unions is more ergonomic, especially when there are no shared fields, as is often the case.
As for pattern matching: Nim has the case statement, which alone is sufficient to make use of the inner contents of an sum type (or currently, switch on a kind label). But structural pattern matching is the bread and butter of ADTs: aside from simply being cleaner than nested if statements, it provides guarantees of exhaustiveness not otherwise given by the alternative. It is perhaps best exemplified by the ML family of languages (including Rust): but other languages have picked it up offering their own takes on it, notably Racket, Swift, and Haxe. This RFC pulls from a number of different designs to provide a powerful and Nim-esque model.
Description
This RFC is two-fold. It introduces a new fundamental type, union, with similar semantics to Rust's enum. It also extends Nim's existing case statement with support for structural pattern matching and particularly that of unions/objects/tuples (although it supports lists as well), and introduces a new supplemental where clause in order to cleanly and powerfully do so.
union is a type that functions similarly to enum at a surface level. It takes a list of identifiers as fields/variants/values, and an instance of a union may only ever be one of the fields. However, each field of the union may additionally be of its own entirely separate type (or none). This is best exemplified with the classic "shape" example:
type Shape = union
Point
Line: int
Square: tuple[side: int]
Rectangle: tuple[x, y: int]
Triangle: TriangleKind
# type TriangleKind = enum Scalene, Isosceles, Equilateral
These fields of the unions are unordered, unqualified (via the same mechanism as overloadable enums), and unique. The inner fields are each of distinct types: entries used solely for their label (ex. Point) are implicitly of void type. Instantiating a union, then, is done by picking a particular variant, and providing data for the construction of its associated type (or none) as needed. The syntax of this is again best shown with an example. As a special syntax rule, tuples and objects may drop their extraneous parentheses on construction.
var shape: Shape
shape = Point()
shape = Line(5)
shape = Square(side: 5)
shape = Rectangle(x: 5, y: 6)
shape = Triangle(Scalene)
To support proper and powerful pattern matching, we introduce three new extensions to the language: an extension for existing case/of statements to support (nested) structured types (array, seq, tuple, object, union), an extension for of expressions in regular conditionals to support the same, and a new where expression that functions as a reusable "guard". Just like union, these language features combine to provide power with minimal new syntax. Following existing case semantics, this pattern matching may be exhaustive (to be described later), or non-exhaustive and followed by elif and else statements.
Elements of a tuple may be matched in two ways: by providing a value, which will match all tuples with that value at that location, or by binding them to a new identifier, which will match all tuples with any value at that location, and will then be made available as a variable within the branch. New variables introduced by pattern matching in this fashion follow the mutability of the matched-upon statement they are a part of. As is the case elsewhere, binding an element to the _ identifier discards it.
where guards allow for restricting the matching of a case based on an arbitrary conditional expression. They have access to any bound fields: and so can be used to restrict variables to a value while still allowing access to that value among other, arbitrary conditions. where clauses follow the main match section of an of line and are considered to be attached to the main match rather than the of itself to allow for the reuse of labels to follow existing case semantics: see the example below. Of important note is the intersection between variable bindings and repeated cases: in order for a variable to be accessible within the body of the case, it must be bound (in a consistent location) in every label of that statement. We do not disallow this at a binding level because the bindings are still helpful for where clauses.
The exhaustiveness of a pattern match is determined on a best-effort attempt. Fields of booleans, enums, and unions consist of few possibilities and may be exhaustively matched upon. Guards do not contribute to the exhaustivity of a search as they may contain arbitrary conditional expressions: and so must typically be followed by a case for the same match without the guard, or else.
Named tuples and objects are matched identically to tuples, with one exception: the identifier name when binding a field must be the name of the original field. If this is undesired, the syntax oldident: _ as newident is introduced to allow for explicit rebinding. Unions are matched by specifying their label, and then their inner type in parenthesis. Tuples and objects within a union may elide their parenthesis.
func match(shape: Shape) =
case shape
of Point():
...
of Line(_ as x) where x in 0..4:
...
of Line(_ as x):
...
# matching an existing variable
of existingShape:
...
# matching an existing variable, convolutedly
of _ as x where x == existingShape:
...
# tuples within a union elide their external parenthesis.
of Square(side: _ as x):
assert declared(x)
assert not declared(side)
# Rectangle(x, y) is syntax sugar for Rectangle(x: _ as x, y: _ as y)
# matching multiple expressions follows case semantics
of Rectangle(x, y) where x == y, Rectangle(0, _), Rectangle(_, 0):
# x is not bound in every case! so it is not accessible
...
else:
...
Previous thoughts on matching sequences, arrays, and sets are collapsed in the below block. I no longer think they're terribly important. They are also not updated for the new backwards-compatible _ as x syntax.
Matching sequences, arrays, sets
Matching sequences and arrays is done slightly differently. To distinguish them from tuple and object matching, brackets are used instead of parenthesis. These are not elided within a union type as the parentheses of tuples and objects are. Matching of elements is done similarly to tuples: individual elements of the sequence/array are matched positionally, the variable/value matching distinction remains the same, andwhere guards are again available for conditional treatment of bindings.
As sequences may be of unknown length, and arrays may have uncomfortably long lengths, the .. operator is introduced/reused to denote a range of fields we do not care about matching, with the syntax [first, second, .. last]. As a result, the position of fields may be specified either from the beginning or from the end of the list - it is not possible to match a field an arbitrary distance into a long list within the match pattern itself. Again, however, where guards provide a more generic way to do so.
import std/[sequtils, sugar]
func lists[T](list: openarray[T]) =
case list
of [1, 3, 5]:
...
of [first, second, third .. penultimate, last]:
...
of x where x.len > 10 and x[10] == true:
...
# lists may be arbitrarily nested, just like every structured type
of [[1, 2, 3], [4, 5, 6], [7, 8, 9]]:
...
# match an arbitrary list of integers
of x where typeof(x) is openarray[int]:
...
# match a list of entirely 1s:
of x where x.all(y => (y == 1)):
...
# the following three statements match identically:
of [_, _, _]:
...
of x where x.len == 3:
...
elif list.len == 3:
...
else:
...
Matching sets and tables mostly follows list semantics. Both are wrapped in braces instead of brackets or parentheses. As the elements of sets and tables are never named: the : does not function as it does in tuples and objects, instead, it takes its regular semantics within tables to denote a key-value pair. Of additional note is the .. operator: as sets and tables can be of unknown length it is helpful: however, as they are unordered, the .. operator must come last.
func sets(bitset: set[int8]) =
case bitset
# match the set containing exclusively 1
of {1}:
...
# match a set containing at least 1
of {1, ..}:
...
# match a set containing only two distinct elements
of {x, y}:
...
# match a set containing any two distinct elements fulfilling a condition
of {x, y, ..} where x + y == 10:
...
else:
...
func tables(table: Table[int, string]) =
case table
# match a table containing only these key-value pairs
of {5: "foo", 6: "bar"}:
...
# match a table containing at least these key-value pairs
of {5: "foo", 6: "bar", ..}:
...
# match a table containing any key matching a condition
of {x: y, ..} where x == y.len:
...
While the exhaustivity of pattern matching is important: frequently you want to disregard it, and only worry about a single case. Rust provides the if let construct for this use case, which I find very useful but pretty gross syntactically. Instead, this RFC uses the syntax if shape of Square(side: binding). The bindings become immediately available for use within the body of the if statement and within any parts of the if statement following that clause, as shown by the following examples:
shape = Rectangle(x: 5, y: 6)
if shape of Rectangle(x, y) and x == 5:
echo "y is " & $y
else:
echo "failure."
Following the semantics above, these extended case and if ident of match statements should also allow unwrapping Option[T] types. What this looks like is not yet specified.
Code Examples
Algebraic Data Types
An example that perhaps more showcases the utility of ADTs is that of an abstract syntax tree for the simple lambda calculus.
type Ident = string
type Expr = ref union
Constant: Term
Variable: tuple[id: Ident]
# Annotation: tuple[expr: Expr, kind: Type]
Abstraction: tuple[param: Ident, fn: Expr]
Application: tuple[fn, arg: Expr]
Conditional: tuple[cond, `if`, `else`: Expr]
Note that other structures needed for an implementation of the simple lambda calculus (including a type system) are also efficiently and aesthetically modelled with ADTs.
type Type = union
Empty, Error, Unit, Boolean, Natural, Integer, Float, String
Enum: seq[Type]
Struct: Table[Ident, Type]
Function: tuple[from, to: ref Type]
type Term = union
Unit
Boolean: bool
Natural: uint
Integer: int
Float: float
String: tuple[len: uint, cap: uint, data: seq[uint]]
Enum: tuple[val: uint, data: seq[Type]]
Struct: Table[Ident, Term]
Function: Expr
Pattern Matching
var globalSymbolTable = initTable[Ident, Term]()
func execute(ast: Expr): Term =
case ast
of Constant(_ as term):
return term
of Variable(id):
return globalSymbolTable[id]
# other structures of applications + abstractions are invalid
of Application(Abstraction(param, fn), arg):
globalSymbolTable[param] = execute(arg)
return execute(fn)
of Conditional(cond, `if`: _ as ifClause, `else`: _ as elseClause):
if execute(cond) == Boolean(true):
return execute(ifClause)
else:
return execute(elseClause)
else:
echo "Failed to execute AST! Improper structure"
quit()
Backwards Compatibility
This RFC is almost fully backwards compatible with Nim. It will cause issues with anything using the new union or where keywords, and with any templates or macros directly inspecting case or of.
A previous version of this RFC had an issue around pattern matching: the previous syntax was of x, where x is a newly bound identifier. This conflicted with the existing semantics to resolve constants in those expressions. This was fixed by changing pattern matching to require explicitly rebinding the catchall character _, i.e. _ as name, with syntax sugar in selected cases (named tuples & structs).
Notes
To end with some notes: the general design of where guards is to provide all of the parts of pattern matching that cannot be exhaustive, in one general construct. Hence the use of where for things regarding ranges, lengths, existing variables, generic types (although: type constraints can be exhaustive. this is specific enough that i thought it not worth special syntax). where was also chosen over reusing if because I wanted to emphasize that there is not a corresponding elif/else as it is, already, what amounts to an elif clause within the broader case statement.
I think I like oldident: newident for field renaming. It matches object constructors. oldident as newident is an alternative, however.
I only mentioned it briefly, but because I've seen it discussed in other implementations of pattern matching in Nim: the mutability of the introduced bindings while pattern matching follows from the mutability of the matched-upon object, that is, you're directly mutating the matched-upon object. Other languages follow this pattern and I see no reason why this + variable shadowing if you really want to make it mutable are good enough.
I'm considering allowing guards to be reusable on for blocks a la Swift: with for binding in iterator where condition acting as syntax sugar to the common for binding in iterator: if condition. Is this desired? Is this good or bad for language consistency? Are there other places where clauses could be used?
The else statement and case _ are functionally the same. I believe this is currently the case.
References and Acknowledgements
- https://github.com/nim-lang/RFCs/issues/245 and https://github.com/nim-lang/RFCs/issues/368
- Nim: patty, gara, fusion/matching, fungus
- Python: PEP 635 and PEP 636
- Racket: the
matchstatement - Swift: the
wherestatement - Grain: nested pattern matching
- Rust: the
matchstatement
Many thanks to fungus: which implements a near-identical union and quite similar match already as macros.
Many thanks to patty: particularly for their "does not yet work" section, ha.
Many thanks to Beef, for having good taste and calling my stupid ideas stupid.
Is it really necessary to change the language syntax for this? Maybe it would be better to standardize macros that implement the same thing on the existing syntax?
The current variant objects are really quite unconvinient, and it would be nice to have syntactic sugar over them to create algebraic types.
- Python: Success, used by people, part of StdLib (very recent on very old lang).
- Racket: Success, used by people, part of StdLib.
- Swift: Success, used by people, part of StdLib.
- Grain: Success, used by people, part of StdLib.
- Rust: Success, used by people, part of StdLib.
- Nim: Fail, not widely used by people, not part of StdLib.
🤔
That doesn't mean much beyond the fact that people avoid dependencies for various reasons.
So what does this RFC accomplish that e.g. fusion/matching does not?
You don't need to teach me about ML and Rust and Racket and Foozbaz, I know. You need to tell me why you found the existing solutions based on macros lacking.
fact that people avoid dependencies
We agree.
So what does this RFC accomplish that e.g. fusion/matching does not?
The union type is new. Something similar exists as a macro in beef/fungus, but the implementation differs from how it would be implemented in the compiler for performance and semantic reasons.
fusion/matching also provides comprehensive structural pattern matching, there is no feature discrepancy between the two. This RFC is an attempt to integrate it into the language more: by means of reusing existing case syntax and pulling things that would require specific new syntax and not help with exhaustiveness into where clauses (this is described in more detail in the first paragraph of the notes section above).
After work, I'll write up a detailed comparison between the two because I think it reveals some interesting design decisions. But a place to look for now, if interested, is the difference between this RFC's use of where guards vs. fusion/matching's use of until, all, some, @. Revisiting fusion also reminds me I forgot to describe how tables are matched. It's exactly as you would expect and consistent with tuples etc, I'll add this later too.
Also, a little note on other languages: despite Racket's match statement being the most powerful of the bunch and Python being the closest language to Nim syntactically, I took the least inspiration from them. I find Python's match to be grotesquely complex for relatively little benefit, and think where clauses a la Swift to be a much nicer solution. This proposal is also purposely less powerful than Racket: Racket's ability to match on repetitions of arbitrary nested list expressions requires special syntax and is only particularly helpful to a list-oriented language. Nonetheless, it can be done with where clauses if needed (this is a trend...).
Very nice!
ADT's are to Nim case types what scalars are to vectors - you can express everything an int can express using a seq[int] but doing so is cumbersome in code and misses out on many of the assumptions you can make and therefore on the help that the language and compiler can bring.
It's true that the proposed type sort of is a special case of a case object with a specific form of enum, exactly one discriminator and no other members - but that special case is useful in many constructs which logically express "one and only one of these things" - the rest is followup on top of that concept.
Rather than asking "can macros solve this?", the better question is "can we create better macros with this feature present", and there the answer is undoubtedly yes, because it's such a fundamental construct that appears over and over in many code patterns.
Many things become more simple when you can these assumptions about a type: as mentioned already: resetting or overwriting during serialization, exhaustive case statements, etc.
Here's a simple example which relies on the knowledge that a type takes on exactly one shape: because the type is guaranteed to be exactly one of the enumerated options, we can reason about that option without specifying the full type but still express a meaningful relationship to it. In Result[T, E] this is often useful because you want to be able to express a type that matches any Result[T, X] where X is known but T is not to express the idea that you are working with "the error side of type X of any Result" without knowing the exact type of the result (when returning an error you don't care about the "value" side - this is supplied by the context in which the error is being returned, not the error return itself) - such a type can safely be converted to a "full" Result because it is known per language semantics that Result itself takes one and only one of the given options.
I generally like the idea and I was one of the complainers about the PM situation. But as I said in Discord, if this ever gets accepted, who is going to take ownership and implement it. Especially PM? I really hope that the roadmap is unaltered and IC still comes right after 2.0.
So what does this RFC accomplish that e.g. fusion/matching does not?
At the very least, it will be a workaround for this :P https://github.com/nim-lang/Nim/issues/20435
How would the proposed syntax distinguish between a single field that is a tuple and multiple fields? Using tuples for the latter is not nice imo. I'd prefer something closer to the syntax for constructing/matching the variants (which is also how Rust and Swift do it):
type Shape = union
Point()
Line(int)
Square(side: int)
Rectangle(x, y: int)
Triangle(TriangleKind)
In my unwritten proposal I used this syntax:
type
Tree = ref case kind # also possible without the `ref`
of StringLeaf:
str: string
of IntLeaf:
x: int
of Parent:
kids: seq[Tree]
No new keyword required and the syntax is optimized for pasting into a case statement.
@konsumlamm Currently:
type Shape = union
Rectangle: tuple[x, y: int]
type Shape = union
Rectangle: tuple[tuple[x, y: int]]
I picked this syntax to be as close to Nim's existing semantics as possible while still providing the full power of ADTs. I like your syntax more aesthetically, but IMO it doesn't fit in with tuple / object types as well. Also note that the inner type of Rectangle in both of our examples is tuple[x, y: int]: even if it's not explicitly specified in yours.
@Araq I assume the unwritten proposal was mostly for syntactic sugar over the existing variant object syntax? A benefit of the union proposal and ADTs as a whole is that they let you abstract over arbitrary types, not just named fields of objects. (eg. you don't really need the labels str, x, kids in your Tree implementation, just the types)
eg. you don't really need the labels str, x, kids in your Tree implementation, just the types
That's a good point but easily fixed:
type
Tree = ref case # also possible without the `ref`
of StringLeaf:
string
of IntLeaf:
int
of Parent:
seq[Tree]
The advantages remain: Natural Nim syntax, no new keyword, easily turned into a case statement.
I disagree that that's natural Nim syntax, I find it subjectively very ugly, the lack of a field following case inconsistent, and the raw types in branches weird. I don't think being easily turned into a case statement is a particular advantage, and it really doesn't follow type declaration syntax or semantics (I would be quite thrown off learning Nim for the first time).
I'm not particularly attached to the union keyword but I think introducing a new keyword to make it a first-class type on the level of object / tuple / enum would be worth the tradeoff from a language design perspective.
I suppose we could also do this leaving eg. nested unambiguously available for use elsewhere, though I don't like it.
type Tree = ref nested enum # oops need indirection
StringLeaf: string
IntLeaf: int
Parent: seq[Tree]
I care about keeping the of branches, if you want to change case to sumtype or union I don't mind it. Though both sumtype and union are ugly. Maybe enum?
type
Tree = ref enum # also possible without the `ref`
of StringLeaf:
string
of IntLeaf:
int
of Parent:
seq[Tree]
type Tree = nested enum
StringLeaf: string
IntLeaf: int
Parent: seq[Tree]
is wrong for two reasons: You need the ref indirection in order to nest it. And StringLeaf: string is not a field of type string, it is a branch that supports an unnamed string field.
That kinda goes against the existing semantics of enum being explicitly an enumeration though, and also conflicts a bit semantically (not in syntax) with the existing feature that lets you set labels of an enum to values. My original approach was to overload the enum type but I found it conflicted too much with existing expectations of enum.
Now that I think about it more, I really don't like the idea that you can copy and paste the of branches into a case statement actually: most of the times when I use ADTs in other languages I don't want to handle all the cases, and/or want to handle particular cases differently, via structural pattern matching or guards or what not. The extended examples I picked for this RFC are a bit contrived in that they are for interpreting a language where you have to handle every case without restrictions. I think that'd kind of lead people in the wrong direction wrt. pattern matching.
re: ref, oops. ref indirection is actually supported and necessary in this RFC btw, there's some examples in the collapsed Code Examples block.
And StringLeaf: string is not a field of type string, it is a branch that supports an unnamed string field.
Yeah, even though they're comparable to objects and the syntax looks similar, they're not. This is one of the few parts that differs from existing semantics. I personally think it's fine - concepts do the same thing and introduce different-than-objects but still very consistent syntax within their type declaration.
I think that'd kind of lead people in the wrong direction wrt. pattern matching.
Actually, this is how you ensure correctness, having thought about it, you handle every case explicitly. It's the "oh and in all other cases (else) do ..." that produces bugs. In the Nim compiler and everywhere else.
Maybe case object?
If a non-macro solution for ADTs/Pattern Matching is finally being considered, then my vote goes to a rust/haxe like syntax only. A lot of the syntaxes suggested above are confusing / error prone / ugly just like the existing object variant syntax (imo).
ML style ADTs would be a great idea and I won't quibble about the syntax but I worry that too much conditional freedom on each match arm eg. x where typeof(x) is ... will cripple exhaustiveness checking which is biggest unseen benefit of ADT's. I would be happy with much less pattern matching expressivity in return for the compiler flagging missing patterns. IMO previous efforts like fusion/matching and patty do too much.
Have you looked D's sumtype? It is inspired by ADTs but leans heavily into static introspection and would be a nicer fit for Nim vs. trying to copy Haskell/OCaml.
Have you looked D's sumtype? It is inspired by ADTs but leans heavily into static introspection and would be a nicer fit for Nim vs. trying to copy Haskell/OCaml.
D's sumtype doesn't support nested patterns, which is half the point of pattern matching and would make a lot of things more cumbersome, so I don't think that's an approach we should follow.
D's
sumtypedoesn't support nested patterns, which is half the point of pattern matching and would make a lot of things more cumbersome.l, so I don't think that's an approach we should follow.
It's more cumbersome but when you can only destructure on one thing at a time static analysis is much easier, you can get better IDE support, possibly even exhaustiveness checking, and maybe even case splitting like Idris.
possibly even exhaustiveness checking
Nim's case statement already does exhaustiveness checking and I have no intention of moving to something that lacks it.
Nim's case statement exhaustiveness checking is pretty brittle:
type
AnEnum = enum
A
B
let e = A
case e
of A: echo "A"
else:
case e
of B: echo "B"
Here I get the error:
/tmp/testcase.nim(11, 3) Error: not all cases are covered; missing: {A}
It's not "brittle", it works as intended. I'm also not familiar with any mainstream language that offers exhaustiveness checking over nested case/match statements in the way that you outline.
Not mainstream, and not completely generally but with a ton of effort Haskell does. But the point is by making the pattern matching more "cumbersome" but also more analyzable you can get this but without most of the implementation complexity.
@deech Ah, I'll write up a little thing on this with more detail. But basically: I'm very aware of this, and designed it so that the where statement is used for things that cannot be exhaustively checked.
@metagn I don't love case object - I think that gives the impression that it is fundamentally an object, which it isn't. Maybe case enum though? Though I do much prefer a new keyword.