spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-25050][SQL] Avro: writing complex unions

Open steven-aerts opened this issue 3 years ago • 4 comments
trafficstars

Add the capability to write complex unions, next to reading them. Complex unions map to struct types where field names are member0, member1, etc. This is consistent with the behavior in SchemaConverters for reading them and when converting between Avro and Parquet.

What changes were proposed in this pull request?

Spark was able to read complex unions already but not write them. Now it is possible to also write them. If you have a schema with a complex union the following code is now working:

spark
  .read.format("avro").option("avroSchema", avroSchema).load(path)
  .write.format("avro").option("avroSchema", avroSchema).save("/tmp/b")

While before this patch it would throw Unsupported Avro UNION type when writing.

Why are the changes needed?

Fixes SPARK-25050, lines up read and write compatibility.

Does this PR introduce any user-facing change?

The behaviour improved of course, this is as far as I could see not impacting any customer facing API's or documentation.

How was this patch tested?

  • Added extra unit tests.
  • Updated existing unit tests for improved behaviour.
  • Validated manually with an internal corpus of avro files if they now could be read and written without problems. Which was not before this patch.

steven-aerts avatar May 11 '22 07:05 steven-aerts

Can one of the admins verify this patch?

AmplabJenkins avatar May 11 '22 21:05 AmplabJenkins

@steven-aerts Thanks for working on this feature.

+1 to this PR. The lack of complex union type write support causes us problems too. Right now, since the standard Dataframe/Dataset APIs do not support writing out unions with multiple subtypes, we have been deferring to changing the underlying schema which maybe cumbersome in some cases or having to use the saveAsNewAPIHadoopFile RDD API which skips the Catalyst path.

cc: @mridulm

thejdeep avatar Aug 11 '22 17:08 thejdeep

+CC @dongjoon-hyun, @HyukjinKwon who might be able to review this better than me.

mridulm avatar Aug 11 '22 20:08 mridulm

@steven-aerts just want to check if you are still working on this?

xkrogen avatar Aug 23 '22 17:08 xkrogen

Is there still something I can/have to do to get this patch submitted?

steven-aerts avatar Dec 02 '22 08:12 steven-aerts

Unfortunately none of I / @thejdeep / @robreeves are committers, just interested parties, so we can't merge. We need a review from a committer as well.

@gengliangwang , @cloud-fan , @dongjoon-hyun , can any of you take a look at this?

xkrogen avatar Dec 02 '22 17:12 xkrogen

@gengliangwang any comments on the latest diff, after @steven-aerts answered your last question? Seems that this PR is in a very healthy state, I would love to see it merged.

xkrogen avatar Jan 17 '23 14:01 xkrogen

Ping @cloud-fan @gengliangwang @dongjoon-hyun , are any of you available to help review?

xkrogen avatar Feb 21 '23 22:02 xkrogen

@steven-aerts @xkrogen Sorry for the late reply. I will take another look later today.

gengliangwang avatar Feb 21 '23 23:02 gengliangwang

cc @bozhang2820 since you made https://github.com/apache/spark/commit/551b504cfe38d1ab583e617c37e49659edd65c2e

gengliangwang avatar Feb 22 '23 06:02 gengliangwang

Merging to master/3.4. cc Spark 3.4.0 release manager @xinrong-meng

gengliangwang avatar Feb 23 '23 21:02 gengliangwang

Thank you all! I'm also supporting @gengliangwang 's backporting decision.

dongjoon-hyun avatar Feb 23 '23 21:02 dongjoon-hyun