spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used.

Open urosstan-db opened this issue 1 year ago • 9 comments

What changes were proposed in this pull request?

DataFrameReader has 3 APIs for JSON reading

json(DataSet[String]) json(Rdd) json(filePath)

First two APIs respects provided user schema nullability when spark flag spark.sql.legacy.respectNullabilityInTextDatasetConversion is set to true, but third one does not respect and provided schema nullability is always overriden to true.

E.g. dataFrameReader.json(jsonRDD) and dataFrameReader.json(jsonDataSet) will check mentioned config, but dataFrameReader.json(path) will hit totally different code path, and it will end up in FileTable where dataSchema getter will override fields nullability to true.

Why are the changes needed?

Some users just want to have a validation of data and to get exception when some field is nullable.

Does this PR introduce any user-facing change?

When customer set newly added Spark conf spark.sql.respectUserSchemaNullabilityForFileDataSourceWithFilePath, provided user schema nullability will not be overriden to true anymore. Default value for flag is false.

How was this patch tested?

Using integration test in base JsonSuite class.

Was this patch authored or co-authored using generative AI tooling?

No

urosstan-db avatar Oct 02 '24 09:10 urosstan-db

I will add JIRA item as well in future, if we decide to go with this PR.

urosstan-db avatar Oct 02 '24 09:10 urosstan-db

@MaxGekk What do you think about this change? @gengliangwang

urosstan-db avatar Oct 02 '24 11:10 urosstan-db

cc @HyukjinKwon

MaxGekk avatar Oct 02 '24 13:10 MaxGekk

IIRC, this is a long-standing behavior to avoid unexpected nullability. Could you further explain why do we have to support non-nullable user-provided schema?

cc @cloud-fan who probably has the most context on this one

gengliangwang avatar Oct 02 '24 20:10 gengliangwang

IIRC, this is a long-standing behavior to avoid unexpected nullability. Could you further explain why do we have to support non-nullable user-provided schema?

cc @cloud-fan who probably has the most context on this one

I have to talk with users, but for now, they explicitly wanted to have nullability of provided schema, and I suppose they want exception during parsing if some field is null.

urosstan-db avatar Oct 02 '24 20:10 urosstan-db

When the initial PR merged for json and csv mentioned in the PR (https://github.com/apache/spark/pull/33436), I realised that I underestimated the problem (https://github.com/apache/spark/pull/33436#discussion_r863271189). For reading actual files, there'd be much more cases that it could break. One of the cases were streaming cases that the schema change over time etc.

HyukjinKwon avatar Oct 03 '24 23:10 HyukjinKwon

We tried to fix this few times, e.g., https://github.com/apache/spark/pull/17293 and https://github.com/apache/spark/pull/14124. The breakage was pretty severe

HyukjinKwon avatar Oct 03 '24 23:10 HyukjinKwon

We tried to fix this few times, e.g., #17293 and #14124. The breakage was pretty severe

Thanks a lot @HyukjinKwon, what can we do for user here, they want to get an error when column is null or missing

urosstan-db avatar Oct 04 '24 13:10 urosstan-db

Is this only a problem for file source v2? If yes I think we should just fix file source v2 to respect LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION

cloud-fan avatar Oct 08 '24 13:10 cloud-fan

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jan 17 '25 00:01 github-actions[bot]