spark
spark copied to clipboard
[SPARK-25050][SQL] Avro: writing complex unions
Add the capability to write complex unions, next to reading them. Complex unions map to struct types where field names are member0, member1, etc. This is consistent with the behavior in SchemaConverters for reading them and when converting between Avro and Parquet.
What changes were proposed in this pull request?
Spark was able to read complex unions already but not write them. Now it is possible to also write them. If you have a schema with a complex union the following code is now working:
spark
.read.format("avro").option("avroSchema", avroSchema).load(path)
.write.format("avro").option("avroSchema", avroSchema).save("/tmp/b")
While before this patch it would throw Unsupported Avro UNION type when writing.
Why are the changes needed?
Fixes SPARK-25050, lines up read and write compatibility.
Does this PR introduce any user-facing change?
The behaviour improved of course, this is as far as I could see not impacting any customer facing API's or documentation.
How was this patch tested?
- Added extra unit tests.
- Updated existing unit tests for improved behaviour.
- Validated manually with an internal corpus of avro files if they now could be read and written without problems. Which was not before this patch.
Can one of the admins verify this patch?
@steven-aerts Thanks for working on this feature.
+1 to this PR. The lack of complex union type write support causes us problems too. Right now, since the standard Dataframe/Dataset APIs do not support writing out unions with multiple subtypes, we have been deferring to changing the underlying schema which maybe cumbersome in some cases or having to use the saveAsNewAPIHadoopFile RDD API which skips the Catalyst path.
cc: @mridulm
+CC @dongjoon-hyun, @HyukjinKwon who might be able to review this better than me.
@steven-aerts just want to check if you are still working on this?
Is there still something I can/have to do to get this patch submitted?
Unfortunately none of I / @thejdeep / @robreeves are committers, just interested parties, so we can't merge. We need a review from a committer as well.
@gengliangwang , @cloud-fan , @dongjoon-hyun , can any of you take a look at this?
@gengliangwang any comments on the latest diff, after @steven-aerts answered your last question? Seems that this PR is in a very healthy state, I would love to see it merged.
Ping @cloud-fan @gengliangwang @dongjoon-hyun , are any of you available to help review?
@steven-aerts @xkrogen Sorry for the late reply. I will take another look later today.
cc @bozhang2820 since you made https://github.com/apache/spark/commit/551b504cfe38d1ab583e617c37e49659edd65c2e
Merging to master/3.4. cc Spark 3.4.0 release manager @xinrong-meng
Thank you all! I'm also supporting @gengliangwang 's backporting decision.