frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Named intermediate Datasets

Open OlivierBlanvillain opened this issue 6 years ago • 4 comments

case class Foo(bar: Int, baz: String, bal: Boolean)
val ds: TypedDataset[Foo]

This is nicely named and typed! However, after a select, the names are completely lost:

val ds1: TypedDataset[Tuple2[Int, String]] = ds.select(ds('bar), ds('baz))

The best one can do with the current API is to define a new case class for the intermediate representation, and use .as[] to get a ds1 with useful columns names.

Somes idea to workaround this issue:

  • Use a macro to generate a case classes "on the fly", something like this:

    ds.selectNamed(ds('bar), ds('baz))
    // Expends to
    case class FooBarBaz(bar: Int, baz: String)
    ds.selectNamed(ds('bar), ds('baz)).as[FooBarBaz]
    
  • Instead of TupleN, type the resulting Dataset with a shapeless record and update.

OlivierBlanvillain avatar Sep 23 '17 08:09 OlivierBlanvillain

Thanks @OlivierBlanvillain, could you provide some historical perspective on the second approach? It seems similar to the old TypedDataFrame description in the docs but I am not aware of how Spark behaved back then.

iravid avatar Sep 23 '17 08:09 iravid

I think we don't need anything special from Spark here. If instead of a Tuple2 we use something like the following:

R[Int, "bar"] :: R[String, "baz"] :: HNil

Then it's possible to write a type function from "baz" to 2, and use it to compute ds('_2) from ds("baz"), which mean we don't even need to change anything from our runtime. Of course it's also possible to not rename the columns and keep the original names for one less indirection, but we would still need to synthesise new names for the result of operation: ds.select(ds('a) + ds('b)): TypedDataset[???].

OlivierBlanvillain avatar Sep 23 '17 08:09 OlivierBlanvillain

Certainly. I agree that this should not change anything in how we interact with Spark.

Reading again the docs for TypedDataFrame, I see it is unrelated so I retract my previous comment.

In any case, I am in favour of using shapeless records for this as it looks like a far more maintainable approach for the project, albeit one sacrificing IDE-friendliness and compile times. Both of those might very well change in the future, though.

iravid avatar Sep 23 '17 08:09 iravid

Regarding naming, I suggest for simplicity to avoid fresh name synthesis and always use tuple column names unless the user has specifically .alias'd a projection's column.

iravid avatar Sep 23 '17 08:09 iravid