TransmogrifAI
TransmogrifAI copied to clipboard
Dataframe Encoders for TransmogrifAI types
Problem
Currently TransmogrifAI implements a bunch of custom functions to encode/decode TransmogrifAI type to/from Spark dataframe native types (see FeatureSparkTypes, FeatureTypeSparkConverter and FeatureTypeFactory). This method requires applying converters each time values are encoded/decoded to/from a Spark dataframe.
Solution
We need to have a proper implementation of org.apache.spark.sql.Encoder to handle TransmogrifAI types efficiently.
Alternatives N/A
Additional context Ideally we should also avoid boxing/unboxing into TransmogrifAI but this would require a major refactoring. This is up for a discusion.
spark dataframe is dataset of Row, Encoder or Decode for row already exists ,you just need not to define new Encoder or Decoder for Row.
@liuzhenhai93 yeah, Dataframe encoding is currently working. We would like to have a support for the following:
implicit val enc: Encoder[(Real, Text)] = ???
val reals: Dataset[(Real, Text)] = spark.createDataset(Seq(1.0.toReal -> "one".toText))
@tovbinm you can try like this in scala import spark.implicits._ case class Wrap[T](unwrap: T) Then whenever you want to use custom type use them inside Wrap like this: val dataFrame = spark.createDataset(Seq(Wrap(2.0,"hello")))
I don’t believe this would work (I will check it). Ideally I would like to avoid allocating another wrapper class, since we already do so (FeatureType is a wrapper around Option, Seq, Map etc).