TransmogrifAI icon indicating copy to clipboard operation
TransmogrifAI copied to clipboard

Dataframe Encoders for TransmogrifAI types

Open tovbinm opened this issue 7 years ago • 4 comments

Problem Currently TransmogrifAI implements a bunch of custom functions to encode/decode TransmogrifAI type to/from Spark dataframe native types (see FeatureSparkTypes, FeatureTypeSparkConverter and FeatureTypeFactory). This method requires applying converters each time values are encoded/decoded to/from a Spark dataframe.

Solution We need to have a proper implementation of org.apache.spark.sql.Encoder to handle TransmogrifAI types efficiently.

Alternatives N/A

Additional context Ideally we should also avoid boxing/unboxing into TransmogrifAI but this would require a major refactoring. This is up for a discusion.

tovbinm avatar Aug 17 '18 05:08 tovbinm

spark dataframe is dataset of Row, Encoder or Decode for row already exists ,you just need not to define new Encoder or Decoder for Row.

liuzhenhai93 avatar Sep 06 '18 03:09 liuzhenhai93

@liuzhenhai93 yeah, Dataframe encoding is currently working. We would like to have a support for the following:

implicit val enc: Encoder[(Real, Text)] = ???
val reals: Dataset[(Real, Text)] = spark.createDataset(Seq(1.0.toReal -> "one".toText))

tovbinm avatar Oct 26 '18 16:10 tovbinm

@tovbinm you can try like this in scala import spark.implicits._ case class Wrap[T](unwrap: T) Then whenever you want to use custom type use them inside Wrap like this: val dataFrame = spark.createDataset(Seq(Wrap(2.0,"hello")))

gsoni22 avatar Mar 05 '19 11:03 gsoni22

I don’t believe this would work (I will check it). Ideally I would like to avoid allocating another wrapper class, since we already do so (FeatureType is a wrapper around Option, Seq, Map etc).

tovbinm avatar Mar 06 '19 15:03 tovbinm