spark [SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used.

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used.

Open urosstan-db opened this issue 4 months ago • 9 comments

What changes were proposed in this pull request?

DataFrameReader has 3 APIs for JSON reading

json(DataSet[String]) json(Rdd) json(filePath)

First two APIs respects provided user schema nullability when spark flag spark.sql.legacy.respectNullabilityInTextDatasetConversion is set to true, but third one does not respect and provided schema nullability is always overriden to true.

E.g. dataFrameReader.json(jsonRDD) and dataFrameReader.json(jsonDataSet) will check mentioned config, but dataFrameReader.json(path) will hit totally different code path, and it will end up in FileTable where dataSchema getter will override fields nullability to true.

Why are the changes needed?

Some users just want to have a validation of data and to get exception when some field is nullable.

Does this PR introduce any user-facing change?

When customer set newly added Spark conf spark.sql.respectUserSchemaNullabilityForFileDataSourceWithFilePath, provided user schema nullability will not be overriden to true anymore. Default value for flag is false.

How was this patch tested?

Using integration test in base JsonSuite class.

Was this patch authored or co-authored using generative AI tooling?

Oct 02 '24 09:10 urosstan-db

spark spark copied to clipboard

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used.

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard