frameless Named intermediate Datasets

Named intermediate Datasets

Open OlivierBlanvillain opened this issue 6 years ago • 4 comments

case class Foo(bar: Int, baz: String, bal: Boolean)
val ds: TypedDataset[Foo]

This is nicely named and typed! However, after a select, the names are completely lost:

val ds1: TypedDataset[Tuple2[Int, String]] = ds.select(ds('bar), ds('baz))

The best one can do with the current API is to define a new case class for the intermediate representation, and use .as[] to get a ds1 with useful columns names.

Somes idea to workaround this issue:

Use a macro to generate a case classes "on the fly", something like this:

ds.selectNamed(ds('bar), ds('baz))
// Expends to
case class FooBarBaz(bar: Int, baz: String)
ds.selectNamed(ds('bar), ds('baz)).as[FooBarBaz]

Instead of TupleN, type the resulting Dataset with a shapeless record and update.

Sep 23 '17 08:09 OlivierBlanvillain

Thanks @OlivierBlanvillain, could you provide some historical perspective on the second approach? It seems similar to the old TypedDataFrame description in the docs but I am not aware of how Spark behaved back then.

Sep 23 '17 08:09 iravid

I think we don't need anything special from Spark here. If instead of a Tuple2 we use something like the following:

R[Int, "bar"] :: R[String, "baz"] :: HNil

Then it's possible to write a type function from "baz" to 2, and use it to compute ds('_2) from ds("baz"), which mean we don't even need to change anything from our runtime. Of course it's also possible to not rename the columns and keep the original names for one less indirection, but we would still need to synthesise new names for the result of operation: ds.select(ds('a) + ds('b)): TypedDataset[???].

Sep 23 '17 08:09 OlivierBlanvillain

Certainly. I agree that this should not change anything in how we interact with Spark.

Reading again the docs for TypedDataFrame, I see it is unrelated so I retract my previous comment.

In any case, I am in favour of using shapeless records for this as it looks like a far more maintainable approach for the project, albeit one sacrificing IDE-friendliness and compile times. Both of those might very well change in the future, though.

Sep 23 '17 08:09 iravid

Regarding naming, I suggest for simplicity to avoid fresh name synthesis and always use tuple column names unless the user has specifically .alias'd a projection's column.

Sep 23 '17 08:09 iravid

frameless frameless copied to clipboard

Named intermediate Datasets

frameless
frameless copied to clipboard