sparksql-scalapb icon indicating copy to clipboard operation
sparksql-scalapb copied to clipboard

`.as[Foo]` doesn't catch DataFrame Dataset type differences

Open dnmfarrell opened this issue 3 years ago • 3 comments

Spark 3.1 sparksql-scalapb_2.12:0.11.0

When used as a dataset, scalapb generated case classes don't catch type mismatches in the logical plan like ordinary scala case classes do.

E.g. in spark shell (regular case class):

scala> case class Foo(i:Int)
defined class Foo
scala> val badData = Seq[(Long)]((5L))
badData: Seq[Long] = List(5)
scala> val bdf = badData.toDF("i")
bdf: org.apache.spark.sql.DataFrame = [i: bigint]
scala> bdf.as[Foo]
org.apache.spark.sql.AnalysisException: Cannot up cast `i` from bigint to int.
...

Scalapb case class-based encoders don't blow up on as, but when the query runs (e.g. bdf.as[Foo].head).

dnmfarrell avatar Sep 09 '22 20:09 dnmfarrell

~(I would post my scalapb example but I'm running into the spark v scalapb protobuf version difference which I usually shade away but isn't working rn.)~ Ah this is caused by spark-shell automatically importing spark implicits.

dnmfarrell avatar Sep 09 '22 20:09 dnmfarrell

I forked the test repo and can reproduce the issue:

  • 0.11.0 displays the behavior described in this ticket - val ds: Dataset[Small] = df.as[Small] does not throw an exception despite the dataframe being incompatible with the dataset.
  • 1.0.1 does throw an exception but it's: java.lang.NoSuchMethodError - stack trace

dnmfarrell avatar Sep 12 '22 02:09 dnmfarrell

Re: java.lang.NoSuchMethodError - that was my mistake of compiling for spark 3.2, and running on spark 3.1 :facepalm:

It seems like casts from one shape of data to another are caught. E.G. (x: Int) -> (x: Int, y: Int) emits org.apache.spark.sql.AnalysisException: cannot resolve 'y' given input columns.

dnmfarrell avatar Oct 14 '22 14:10 dnmfarrell