spark [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

What changes were proposed in this pull request?

This modifies GenerateUnsafeProjection to wrap projections of non-null fields in try-catch blocks which swallow any NullPointerException that is thrown, and instead throw a helpful error message indicating that a null value was encountered where the schema indicated non-null. This new error is added in QueryExecutionErrors.

Small modifications are made to a few methods in GenerateUnsafeProjection to allow for passing down the path to the projection in question, to give the user a helpful indication of what they need to change. Getting the name of the top-level column seems to require substantial changes since the name is thrown away when the BoundReference is created (in favor of an ordinal), so for the top-level only an ordinal is given; for nested fields the name is provided. An example error message looks like:

java.lang.RuntimeException: The value at <POS_0>.`middle`.`bottom` cannot be null, but a NULL was found. This is typically caused by the presence of a NULL value when the schema indicates the value should be non-null. Check that the input data matches the schema and/or that UDFs which can return null have a nullable return schema.

Why are the changes needed?

This is needed to help users decipher the error message; currently a NullPointerException without any message is thrown, which provides the user no guidance on what they've done wrong, and typically leads them to believe there is a bug in Spark. See the Jira for a specific example of how this behavior can be triggered and what the exception looks like currently.

Does this PR introduce any user-facing change?

Yes, in the case that a user has a data-schema mismatch, they will not get a much more helpful error message. In other cases, no change.

How was this patch tested?

See tests in DataFrameSuite and GeneratedProjectionSuite.

Aug 24 '22 00:08 xkrogen

cc @shardulm94 (thanks for the push towards constructing the path into a single message instead of using chained exceptions, the additional code changes were minimal and the resulting message is much nicer) and a few folks who have been working on error reporting @MaxGekk @karenfeng @gengliangwang

Aug 24 '22 00:08 xkrogen

Pushed up new commits rebasing on latest changes.

cc @cloud-fan @dongjoon-hyun in case either of you are interested.

Aug 31 '22 22:08 xkrogen

Ping again to @MaxGekk @karenfeng @gengliangwang -- any thoughts?

Sep 07 '22 17:09 xkrogen

+CC @ueshin, as you reviewed the surrounding code before. Also, CC @HyukjinKwon.

Sep 18 '22 00:09 mridulm

Looks making sense to me ... @MaxGekk @gengliangwang or @cloud-fan

Sep 19 '22 02:09 HyukjinKwon

Spark trusts data nullability in many places (expressions, projection generators, optimizer rules, etc.). It's a lot of efforts to improve error messages for all these places when data does not match the nullability. We'd better pick a clear scope here.

AFAIK a common source of mismatch is data source and UDF. We can focus on these 2 cases only.

For data sources, we can add a Filter node above the data source relation to apply null check, using the existing AssertNotNull expression. For UDF, we can wrap the UDF expression with AssertNotNull to do the null check as well.

Sep 19 '22 09:09 cloud-fan

Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where user code could introduce a null. This includes data sources (including in-memory dataset creation like sparkContext.parallelize() or DatasetHolder.toDS()) and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently untrustworthy, so more checks are warranted. I think any of these places where user code supplies values would be covered by this PR since they all go through projection before being accessed by other codepaths, but LMK if you disagree. I guess one situation could be if the optimizer completely removes some operator because of the nullability, so the data is never even accessed--are you thinking about situations like this?

Unfortunately we cannot use AssertNotNull for this; the optimizer will remove it if the input schema indicates that the schema is non-null: https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826 And that optimizer rule is there for a good reason; there is nontrivial overhead associated with the null check as discussed further in my other comment.

Sep 19 '22 17:09 xkrogen

I made SparkSession.internalCreateDataFrame public to easily test the nullability mismatch bug. You can also use a data source to reproduce it.

val rdd = sc.makeRDD(Seq(InternalRow(null)))
val df = spark.internalCreateDataFrame(rdd, new StructType().add("i", "int", false))
df.show
+---+
|  i|
+---+
|  0|
+---+

If you use the public APIs like SparkSession.createDataFrame, they use RowEncoder which does runtime null check via expression GetExternalRowField. An example

val rdd = sc.makeRDD(Seq(Row(null)))
val df = spark.createDataFrame(rdd, new StructType().add("i", "int", false))
df.show
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'i' of input row cannot be null.

I think this is the right direction to go: for untrusted data, add extra validation even if it has perf overhead. If perf is very critical and we don't want any overhead, we can add a flag to skip this check and trust the data. try-catch NPE seems a bit hacky and can't cover all the cases.

Sep 26 '22 08:09 cloud-fan

Thanks for your comments @cloud-fan. If it were possible to handle all data types using the exception-catching approach, I would still advocate for it. But you raised a good point that it's not possible to handle primitives in this manner, and this is actually an even more severe issue, because it becomes a correctness issue (NULL silently replaced with 0) instead of just a confusing error message.

My initial next attempt, following along your suggestion regarding AssertNotNull, was going to be along these lines:

Create a new unary expression UntrustedNullGuard. Behaviorally, it is identical to AssertNotNull, except for maybe a different error message.
In contrast to AssertNotNull which gets optimized away in cases of non-nullable input, UntrustedNullGuard gets optimized away in cases of nullable input.
We look for untrusted sources (input data sources, UDFs, etc.) and wrap them in UntrustedNullGuard without concern for whether the data source returns a null or non-null type--if it's nullable, the check will be optimized away, and if it's not, then we get the guard as expected.
We can have a flag like spark.sql.optimizer.guardUntrustedNullability which defaults to true, but can be set to false by those wishing to indicate to Spark that they really do trust their data sources and don't want the runtime checks added. (The optimizer rule won't apply in this case.)

However this approach has some drawbacks. The one I found to be a fatal flaw is that there are situations where the operator itself does some unpacking of the result value and will already replace a NULL value with a default, e.g. here: https://github.com/apache/spark/blob/3f97cd620a61bc2685bf35215185365e63bedc0a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1176-L1180

I don't think there's any way to catch this situation except to modify the codegen for the operators themselves. I've put up an alternate approach demonstrating this in PR #38660; please take a look.

Nov 14 '22 23:11 xkrogen

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Feb 23 '23 00:02 github-actions[bot]

spark spark copied to clipboard

[SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard