frameless
frameless copied to clipboard
Named intermediate Datasets
case class Foo(bar: Int, baz: String, bal: Boolean)
val ds: TypedDataset[Foo]
This is nicely named and typed! However, after a select, the names are completely lost:
val ds1: TypedDataset[Tuple2[Int, String]] = ds.select(ds('bar), ds('baz))
The best one can do with the current API is to define a new case class for the intermediate representation, and use .as[]
to get a ds1
with useful columns names.
Somes idea to workaround this issue:
-
Use a macro to generate a case classes "on the fly", something like this:
ds.selectNamed(ds('bar), ds('baz)) // Expends to case class FooBarBaz(bar: Int, baz: String) ds.selectNamed(ds('bar), ds('baz)).as[FooBarBaz]
-
Instead of
TupleN
, type the resulting Dataset with a shapeless record and update.
Thanks @OlivierBlanvillain, could you provide some historical perspective on the second approach? It seems similar to the old TypedDataFrame
description in the docs but I am not aware of how Spark behaved back then.
I think we don't need anything special from Spark here. If instead of a Tuple2
we use something like the following:
R[Int, "bar"] :: R[String, "baz"] :: HNil
Then it's possible to write a type function from "baz"
to 2
, and use it to compute ds('_2)
from ds("baz")
, which mean we don't even need to change anything from our runtime. Of course it's also possible to not rename the columns and keep the original names for one less indirection, but we would still need to synthesise new names for the result of operation: ds.select(ds('a) + ds('b)): TypedDataset[???]
.
Certainly. I agree that this should not change anything in how we interact with Spark.
Reading again the docs for TypedDataFrame
, I see it is unrelated so I retract my previous comment.
In any case, I am in favour of using shapeless records for this as it looks like a far more maintainable approach for the project, albeit one sacrificing IDE-friendliness and compile times. Both of those might very well change in the future, though.
Regarding naming, I suggest for simplicity to avoid fresh name synthesis and always use tuple column names unless the user has specifically .alias
'd a projection's column.