sparksql-scalapb
sparksql-scalapb copied to clipboard
`.as[Foo]` doesn't catch DataFrame Dataset type differences
Spark 3.1 sparksql-scalapb_2.12:0.11.0
When used as a dataset, scalapb generated case classes don't catch type mismatches in the logical plan like ordinary scala case classes do.
E.g. in spark shell (regular case class):
scala> case class Foo(i:Int)
defined class Foo
scala> val badData = Seq[(Long)]((5L))
badData: Seq[Long] = List(5)
scala> val bdf = badData.toDF("i")
bdf: org.apache.spark.sql.DataFrame = [i: bigint]
scala> bdf.as[Foo]
org.apache.spark.sql.AnalysisException: Cannot up cast `i` from bigint to int.
...
Scalapb case class-based encoders don't blow up on as, but when the query runs (e.g. bdf.as[Foo].head).
~(I would post my scalapb example but I'm running into the spark v scalapb protobuf version difference which I usually shade away but isn't working rn.)~ Ah this is caused by spark-shell automatically importing spark implicits.
I forked the test repo and can reproduce the issue:
0.11.0displays the behavior described in this ticket -val ds: Dataset[Small] = df.as[Small]does not throw an exception despite the dataframe being incompatible with the dataset.1.0.1does throw an exception but it's:java.lang.NoSuchMethodError- stack trace
Re: java.lang.NoSuchMethodError - that was my mistake of compiling for spark 3.2, and running on spark 3.1 :facepalm:
It seems like casts from one shape of data to another are caught. E.G. (x: Int) -> (x: Int, y: Int) emits org.apache.spark.sql.AnalysisException: cannot resolve 'y' given input columns.