dotty-feature-requests
dotty-feature-requests copied to clipboard
Add Records To Dotty
This is a rough sketch of a possible design of records for Dotty
Types
A single-field record type is of the form (a: T)
where a
is an identifier and T
is a type. It expands into an instance of a trait Labelled$a[T]
. We assume that for every field name a
present in a program the following trait will be automatically generated:
trait Labelled$a[+T](val a: T)
A multi-field record type (a_1: T_1, ..., a_n: T_n)
is equivalent to the intersection of single-field record types (a_1: T_1) & ... & (a_n: T_n)
. Since &
is commutative, order of fields does not matter.
A row type is a tuple of label/value pairs. Each label is a string literal. Unlike for record types, order of labels does matter in a row type.
The base trait Record
is defined as follows:
trait Record { def row: Row }
Here, Row
is assumed to be a generic base type of HLists. Record values are instances of Record
types that refine the type of row
.
Values
A record value is of the form (a_1 = v_1, ..., a_n = v_n)
. Assuming the values v_i
have types T_i
this is a shorthand for
new Record with
Labelled$a_1[T_1](v_1) with ...
...
Labelled$a_n[T_n](v_n) {
def row = (("a_1", a_1), ..., ("a_n", a_n))
}
TODO: Define equality TODO: Define how to create a record value from a generic HList representation - on the JDK it seems we can use Java's Proxy mechanism for this.
Very happy to see an effort happening into this direction in Dotty! Kicking off the discussion here with a few problems that need to be solved. Maybe answers to some of them already exist and would be good to capture here.
- If two separate compilation units both define
(foo = "bar")
, will both have their owntrait Labelled$foo
defined or will generation happen once in a global place (e.g. at linking time)? If not global time, how will they be compatible with each other? Do we anticipate a solution for this without runtime reflection? - Is the fact that records are composed from traits in any way relevant to the user or just an implementation detail that is not even supposed to leak? Or in other words just an internal way to encode records using existing Dotty ASTs instead of new constructs?
- How will records behave with regards to subtyping when checking an expected record signature against a given record type? I suppose this just derives from hot Dotty treats
&
intersection types. Are there docs/papers about that aspect yet? - One very interesting aspect of the ScalaRecords implementation pushed by @vjovanov and others is virtualization in the sense that records are just a type-safe view into an arbitrary data structure (defaulting to an underlying Map). This does allow using records where runtime performance is critical. Is there a story for that?
- It would be a nice property if Dotty's Record syntax could be used as a surface syntax for different record implementations. This would probably remove implementation specific complexity from the compiler. It would be especially helpful as records are not a very common language construct in type-safe languages yet and real world experience is still to be gained. Allowing different implementations would help exploring the solution space faster (and for example allow eventual standardization of a solution with practice proven properties).
As a side-note: If Record syntax would be realized as a purely syntactic desugaring, a significant difference to existing desugarings like for-comprehensions would probably be that there needs to be a desugaring for types as well, not only values.
cc @vjovanov @gzm0
Very interesting! What use-case did you have in mind for this? Improving the fundamental structure, a general building block for things like HLists or improving things like joins in Slick (a bit like these anonymous types in LINQ)? Will Tuples be subsumed by this? How will subtyping work in general between row types and classes?
@cvogt
If two separate compilation units both define (foo = "bar"), will both have their own trait Labelled$foo defined or will generation happen once in a global place (e.g. at linking time)?
I don't think it matters. We can generate the class as often as we like - it will always be the same class. Of course, the compiler can avoid generating if it knows it exists already.
Is the fact that records are composed from traits in any way relevant to the user or just an implementation detail that is not even supposed to leak? Or in other words just an internal way to encode records using existing Dotty ASTs instead of new constructs?
Since Labelled traits have $'s in them they are considered as implementation detail. The Record
trait matters though. I.e you could write
Record & { name: String, age: Int }
and get the capability to enumerate all values via row
.
How will records behave with regards to subtyping when checking an expected record signature against a given record type? I suppose this just derives from hot Dotty treats & intersection types.
Exactly.
One very interesting aspect of the ScalaRecords implementation pushed by @vjovanov and others is virtualization in the sense that records are just a type-safe view into an arbitrary data structure (defaulting to an underlying Map). This does allow using records where runtime performance is critical. Is there a story for that?
I think Proxy
could be the link here. Also to be able to use the syntax with different implementations.
Last time I checked JDK Proxies were quite slow.
@viktorklang more details on JDK proxy performance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7123493
Some small details can be found here: https://github.com/tootedom/concurrent-reflect-proxy The last link additionally provides a faster implementation of Proxy classes, but even those are ridiculously slow.
@odersky I could see Proxy as a building block, if we create a compile time Proxy. It sounds like a quite different approach to desugaring than for other Scala concepts though.
@soc The original use case is indeed database tables. It complements tuples, does not subsume them. It also assumes that Tuples get some kind of HList structure.
If two separate compilation units both define (foo = "bar"), will both have their own trait Labelled$foo defined or will generation happen once in a global place (e.g. at linking time)?
I don't think it matters. We can generate the class as often as we like - it will always be the same class. Of course, the compiler can avoid generating if it knows it exists already.
I don't think this would work in environments which enforce Classloader isolation, like OSGi or JEE, and possibly Java 9. If multiple parts of your app contain independently generated copies of these classes, you would either have to export them, which causes conflicts, or make them private, which prevents records from working across modules.
Adding to the thing that @szeiger said: Isn't the class file placement problem very similar to that of the discarded idea to add interfaces to things to make structural types fast?
As an aside, I'm actually skeptical about using records for database joins nowadays. It's simple in relational algebra but not a good fit for a language like Scala. First you have to give up on classes, traits, nesting, and any kind of abstraction that goes above flat records of primitive values. It's not how Scala usually works and even modern SQL databases are more expressive than that.
Then comes the matter of missing values. Inner joins are good for toy implementations but in the real world you need outer joins, too, If you have nullable primitives like C#, you can at least do:
(v1,...,vn) ⟕ (w1,...,wn) → (v1,...,vn, (nullable w1),...,(nullable wn))
With Option
instead of nullable values in Scala, it gets uglier because t → Option[t]
is not idempotent:
(v1,...,vn, o1,...,on) ⟕ (w1,...,wn, p1,...,pn)
→ (v1,...,vn, o1,...,on, Option[w1],...,Option[wn], p1,...,pn)
where o and p are Option types and v and w are non-Option types
Contrast that with a left outer join in Slick which is simply:
C[v] leftJoin C'[w] on ((c, w) → Boolean) → C[(v, Option[w])]
No distinction between nullable and non-nullable source fields, no need for flat tuples. Semantics are not 100% identical (in Slick you can distinguish between a result row where the right-hand side was missing and one where the right-hand side was matched as all null values) but in practice it usually doesn't matter. You have to give up natural joins but that seems like a small price to pay.
@szeiger I weakened the language to admit other implementation schemes that would not rely on class generation. Classloaders are really a nuisance, it would be so nice to be able to ignore them. Maybe using the upcoming(?) ClassDynamic?
Maybe not that relevant, but looked into the cases where Scala use not classes but tuples nowadays and found some cases where records could make code more readable. Assuming that a record {a: 1, b: 2} is somehow a tuple (1, 2) (as defined by the row function) one could come up with the following ideas:
Pattern matching
Names in pattern matching would be an option, this would be useful in cases where most of the extracted values not required (currently leading to many underscores) or some got just confused by the order of the values.
case class Person(name: String, age: Int, <more fields>)
val persons: List[Person])
persons.collect {
case Person(age = a) if a < 18 => "Child"
case Person(age = 18, name = n) => name + "!" // picked only the the required fields
case Person(name, _, _, _) => name // old syntax should be still valid
}
This is a problem that we currently have in some production code, mostly in tests with Scala Test matchPattern.
Magnet pattern
Records could address the problem in the magnet pattern, that no named parameters are supported. (See)
Give tuple fields a concrete name
The next code snipped could be nice for Scala beginners when writing there first for-each-loop over a map:
class Map[K, V] extends Iterable[{key: K, value: V}] { … }
val map = new Map[String, String] { … }
map foreach { entry =>
println(entry.key + ": " + entry.value)
}
Preserving parameter names after tupled
I think the last example is more interesting from documentation perspective. Currently we lost the parameter names of a method after wrapping them into other functions like scalaz.Memo leaving the caller guessing which parameter does what.
def div(dividend: Int, divisor: Int) = dividend / divisor
val memoDiv = scalaz.Memo.weakHashMapMemo((div _).tupled)
// memoDiv is a Function1[{dividend: Int, divisor: Int}, Int],
// not just a Function1[(Int, Int), Int]
memoDiv(divisor = 3, dividend = 9)
memoDiv(9, 3)
The last line would require that a (Int, Int) is somehow a {a: Int, b: Int}. This relation shouldn't transitive (like an implicit conversion), since this would lead with the assumption above that a {a: Int, b: Int} is a (Int, Int) that a {a: Int, b: Int} is a {x: Int, y: Int}.
It would be great to have something similar for sum types.
For instance, consider the following type hierarchy:
sealed trait Foo
case class Bar(s: String, i: Int) extends Foo
case class Baz(b: Boolean) extends Foo
It could desugar to the following:
sealed trait Foo extends Sum[Either[("Bar".type, Bar), Either[("Baz".type, Baz), Nothing]]]
case class Bar(s: String, i: Int) extends Foo {
val sum = Left(("Bar", this))
}
case class Baz(b: Boolean) extends Foo {
val sum = Right(Left(("Baz", this)))
}
Where Sum[A]
is defined as follows:
trait Sum[A] {
def sum: A
}
This would enable generic programming on sum types rather than just records:
trait ToJson[A] {
def toJson(a: A): Json
}
implicit def toJsonSum[A](implicit
toJsonA: ToJson[A]
): ToJson[Sum[A]] =
new ToJson[Sum[A]] {
def toJson(sumA: Sum[A]) = toJsonA.toJson(sumA.sum)
}
implicit def toJsonEither[A, B](implicit
toJsonA: ToJson[A],
toJsonB: ToJson[B]
): ToJson[Either[(String, A), B]] =
new ToJson[Either[(String, A), B]] {
def toJson(eitherAB: Either[(String, A), B]): Json =
eitherAB match {
case Left((name, a)) => Json.obj(name -> toJsonA.toJson(a))
case Right(b) => toJsonB.toJson(b)
}
}
implicit def toJsonNothing: ToJson[Nothing] =
new ToJson[Nothing] {
def toJson(nothing: Nothing): Json = sys.error("This can not happen")
}
The reason why Dataframe
s are so much more pleasant to work with in language like Python is that typing them strictly is a pain... It would be awesome the be able to keep notion about column types in a Record
like fashion.
The ultimate dream of mine is something like (vague pseudo scala code):
val df: Dataframe[{name:String, date:LocalDateTime}] = ???
def addDayOfWeek[T <: {name:String, date:LocalDateTime}]
(df:Dataframe[T]): Dataframe[T | {dayOfWeek: Int}] =
df append df.map(_.date.getDayOfWeek)
val df2: Dataframe[{name:String, date:LocalDateTime, dayOfWeek: Int}] = addDayOfWeek(df)
where it would have to hold that
{a: A} & {b: B} =:= {a: A, b: B}
I am no expert in this area but I feel like having a first class support for Record
types in the language (compare to having it done by macros in shapeless) might be the way to get closer to that point.
I'd love to contribute. I wish I had these ideas back when I was a student and could do that as a part of some thesis.
I see that Record Types are supposed to be in progress. How far am I from foreseeable reality? Am I breaking any of the assumptions or laws of the dot calculus?
Apparently, this proposal is dormant and overlaps with structural types (#1886), which are also meant to support the database use case and hasn't been mentioned here yet. Is there still a use case for records as a separate feature?
OTOH, further library support for structural types might be needed — see https://github.com/lampepfl/dotty/issues/1886#issuecomment-417354150.
Most likely there will be to support generic programming.
For the record, I found the following recent work about records for scala/dotty: https://www.youtube.com/watch?v=ntrSagXL200 http://www.csc.kth.se/~phaller/doc/karlsson-haller18-scala.pdf
It would be nice to have it in Dotty.
@Krever, I think the suggestions in that paper are pretty good, but the real draw of records for me at least, is polymorphic functions, and easily merging types and structures. I think the solutions in that paper would not last without the ability to merge two records, and using subtyping for row polymorphism is probably fine and dandy (I've never worked with it so I have no reason to doubt), but I'm inclined to believe that they would need parameterized polymorphism if they want to see any use outside of database access types.