spark
spark copied to clipboard
[SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
What changes were proposed in this pull request?
This modifies GenerateUnsafeProjection
to wrap projections of non-null fields in try-catch blocks which swallow any NullPointerException
that is thrown, and instead throw a helpful error message indicating that a null value was encountered where the schema indicated non-null. This new error is added in QueryExecutionErrors
.
Small modifications are made to a few methods in GenerateUnsafeProjection
to allow for passing down the path to the projection in question, to give the user a helpful indication of what they need to change. Getting the name of the top-level column seems to require substantial changes since the name is thrown away when the BoundReference
is created (in favor of an ordinal), so for the top-level only an ordinal is given; for nested fields the name is provided. An example error message looks like:
java.lang.RuntimeException: The value at <POS_0>.`middle`.`bottom` cannot be null, but a NULL was found. This is typically caused by the presence of a NULL value when the schema indicates the value should be non-null. Check that the input data matches the schema and/or that UDFs which can return null have a nullable return schema.
Why are the changes needed?
This is needed to help users decipher the error message; currently a NullPointerException
without any message is thrown, which provides the user no guidance on what they've done wrong, and typically leads them to believe there is a bug in Spark. See the Jira for a specific example of how this behavior can be triggered and what the exception looks like currently.
Does this PR introduce any user-facing change?
Yes, in the case that a user has a data-schema mismatch, they will not get a much more helpful error message. In other cases, no change.
How was this patch tested?
See tests in DataFrameSuite
and GeneratedProjectionSuite
.
cc @shardulm94 (thanks for the push towards constructing the path into a single message instead of using chained exceptions, the additional code changes were minimal and the resulting message is much nicer) and a few folks who have been working on error reporting @MaxGekk @karenfeng @gengliangwang
Pushed up new commits rebasing on latest changes.
cc @cloud-fan @dongjoon-hyun in case either of you are interested.
Ping again to @MaxGekk @karenfeng @gengliangwang -- any thoughts?
+CC @ueshin, as you reviewed the surrounding code before. Also, CC @HyukjinKwon.
Looks making sense to me ... @MaxGekk @gengliangwang or @cloud-fan
Spark trusts data nullability in many places (expressions, projection generators, optimizer rules, etc.). It's a lot of efforts to improve error messages for all these places when data does not match the nullability. We'd better pick a clear scope here.
AFAIK a common source of mismatch is data source and UDF. We can focus on these 2 cases only.
For data sources, we can add a Filter node above the data source relation to apply null check, using the existing AssertNotNull
expression. For UDF, we can wrap the UDF expression with AssertNotNull
to do the null check as well.
Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where user code could introduce a null. This includes data sources (including in-memory dataset creation like sparkContext.parallelize()
or DatasetHolder.toDS()
) and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently untrustworthy, so more checks are warranted. I think any of these places where user code supplies values would be covered by this PR since they all go through projection before being accessed by other codepaths, but LMK if you disagree. I guess one situation could be if the optimizer completely removes some operator because of the nullability, so the data is never even accessed--are you thinking about situations like this?
Unfortunately we cannot use AssertNotNull
for this; the optimizer will remove it if the input schema indicates that the schema is non-null:
https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826
And that optimizer rule is there for a good reason; there is nontrivial overhead associated with the null check as discussed further in my other comment.
I made SparkSession.internalCreateDataFrame
public to easily test the nullability mismatch bug. You can also use a data source to reproduce it.
val rdd = sc.makeRDD(Seq(InternalRow(null)))
val df = spark.internalCreateDataFrame(rdd, new StructType().add("i", "int", false))
df.show
+---+
| i|
+---+
| 0|
+---+
If you use the public APIs like SparkSession.createDataFrame
, they use RowEncoder
which does runtime null check via expression GetExternalRowField
. An example
val rdd = sc.makeRDD(Seq(Row(null)))
val df = spark.createDataFrame(rdd, new StructType().add("i", "int", false))
df.show
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'i' of input row cannot be null.
I think this is the right direction to go: for untrusted data, add extra validation even if it has perf overhead. If perf is very critical and we don't want any overhead, we can add a flag to skip this check and trust the data. try-catch NPE seems a bit hacky and can't cover all the cases.
Thanks for your comments @cloud-fan. If it were possible to handle all data types using the exception-catching approach, I would still advocate for it. But you raised a good point that it's not possible to handle primitives in this manner, and this is actually an even more severe issue, because it becomes a correctness issue (NULL silently replaced with 0) instead of just a confusing error message.
My initial next attempt, following along your suggestion regarding AssertNotNull
, was going to be along these lines:
- Create a new unary expression
UntrustedNullGuard
. Behaviorally, it is identical toAssertNotNull
, except for maybe a different error message. - In contrast to
AssertNotNull
which gets optimized away in cases of non-nullable input,UntrustedNullGuard
gets optimized away in cases of nullable input. - We look for untrusted sources (input data sources, UDFs, etc.) and wrap them in
UntrustedNullGuard
without concern for whether the data source returns a null or non-null type--if it's nullable, the check will be optimized away, and if it's not, then we get the guard as expected. - We can have a flag like
spark.sql.optimizer.guardUntrustedNullability
which defaults to true, but can be set to false by those wishing to indicate to Spark that they really do trust their data sources and don't want the runtime checks added. (The optimizer rule won't apply in this case.)
However this approach has some drawbacks. The one I found to be a fatal flaw is that there are situations where the operator itself does some unpacking of the result value and will already replace a NULL value with a default, e.g. here: https://github.com/apache/spark/blob/3f97cd620a61bc2685bf35215185365e63bedc0a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1176-L1180
I don't think there's any way to catch this situation except to modify the codegen for the operators themselves. I've put up an alternate approach demonstrating this in PR #38660; please take a look.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!