spark [SPARK-25050][SQL] Avro: writing complex unions

trafficstars

Add the capability to write complex unions, next to reading them. Complex unions map to struct types where field names are member0, member1, etc. This is consistent with the behavior in SchemaConverters for reading them and when converting between Avro and Parquet.

What changes were proposed in this pull request?

Spark was able to read complex unions already but not write them. Now it is possible to also write them. If you have a schema with a complex union the following code is now working:

spark
  .read.format("avro").option("avroSchema", avroSchema).load(path)
  .write.format("avro").option("avroSchema", avroSchema).save("/tmp/b")

While before this patch it would throw Unsupported Avro UNION type when writing.

Why are the changes needed?

Fixes SPARK-25050, lines up read and write compatibility.

Does this PR introduce any user-facing change?

The behaviour improved of course, this is as far as I could see not impacting any customer facing API's or documentation.

How was this patch tested?

Added extra unit tests.
Updated existing unit tests for improved behaviour.
Validated manually with an internal corpus of avro files if they now could be read and written without problems. Which was not before this patch.

May 11 '22 07:05 steven-aerts

Can one of the admins verify this patch?

May 11 '22 21:05 AmplabJenkins

@steven-aerts Thanks for working on this feature.

+1 to this PR. The lack of complex union type write support causes us problems too. Right now, since the standard Dataframe/Dataset APIs do not support writing out unions with multiple subtypes, we have been deferring to changing the underlying schema which maybe cumbersome in some cases or having to use the saveAsNewAPIHadoopFile RDD API which skips the Catalyst path.

cc: @mridulm

Aug 11 '22 17:08 thejdeep

+CC @dongjoon-hyun, @HyukjinKwon who might be able to review this better than me.

Aug 11 '22 20:08 mridulm

@steven-aerts just want to check if you are still working on this?

Aug 23 '22 17:08 xkrogen

Is there still something I can/have to do to get this patch submitted?

Dec 02 '22 08:12 steven-aerts

Unfortunately none of I / @thejdeep / @robreeves are committers, just interested parties, so we can't merge. We need a review from a committer as well.

@gengliangwang , @cloud-fan , @dongjoon-hyun , can any of you take a look at this?

Dec 02 '22 17:12 xkrogen

@gengliangwang any comments on the latest diff, after @steven-aerts answered your last question? Seems that this PR is in a very healthy state, I would love to see it merged.

Jan 17 '23 14:01 xkrogen

Ping @cloud-fan @gengliangwang @dongjoon-hyun , are any of you available to help review?

Feb 21 '23 22:02 xkrogen

@steven-aerts @xkrogen Sorry for the late reply. I will take another look later today.

Feb 21 '23 23:02 gengliangwang

cc @bozhang2820 since you made https://github.com/apache/spark/commit/551b504cfe38d1ab583e617c37e49659edd65c2e

Feb 22 '23 06:02 gengliangwang

Merging to master/3.4. cc Spark 3.4.0 release manager @xinrong-meng

Feb 23 '23 21:02 gengliangwang

Thank you all! I'm also supporting @gengliangwang 's backporting decision.

Feb 23 '23 21:02 dongjoon-hyun

spark spark copied to clipboard

[SPARK-25050][SQL] Avro: writing complex unions

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard