spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48939][AVRO] Support reading Avro with recursive schema reference

Open eason-yuchen-liu opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth.

Why are the changes needed?

Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is:

{
  "type": "record",
  "name": "LongList",
  "fields" : [
    {"name": "value", "type": "long"},
    {"name": "next", "type": ["null", "LongList"]}
  ]
}

This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it.

Does this PR introduce any user-facing change?

Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth.

How was this patch tested?

Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

eason-yuchen-liu avatar Jul 19 '24 17:07 eason-yuchen-liu

@WweiL @bogao007 PTAL. Thanks!

eason-yuchen-liu avatar Jul 23 '24 18:07 eason-yuchen-liu

@cloud-fan Could you please review it? Thanks!

eason-yuchen-liu avatar Jul 29 '24 20:07 eason-yuchen-liu

cc @HeartSaVioR and @rangadi

HyukjinKwon avatar Aug 02 '24 01:08 HyukjinKwon

@gengliangwang Would you mind helping reviewing the change as you've been one of the main reviewers for Avro? I can give a try, but I don't feel like I'm qualified to review and sign-off.

HeartSaVioR avatar Aug 06 '24 03:08 HeartSaVioR

Friendly reminder, @gengliangwang

HeartSaVioR avatar Aug 09 '24 13:08 HeartSaVioR

@HeartSaVioR Thanks for the ping. I will find time to review this one recently.

gengliangwang avatar Aug 09 '24 14:08 gengliangwang

Thanks for the review routing! When it's convenient, can people assign this PR to @WweiL and @hkulyc so that we can keep track of reviews? Thanks again!

WweiL avatar Aug 15 '24 00:08 WweiL

With this one merged https://github.com/apache/spark/pull/48043 This PR can be closed. Thanks for everyone who worked / reviewed this PR!

WweiL avatar Sep 24 '24 23:09 WweiL

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jan 03 '25 00:01 github-actions[bot]