[SPARK-48939][AVRO] Support reading Avro with recursive schema reference
What changes were proposed in this pull request?
The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth.
Why are the changes needed?
Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is:
{
"type": "record",
"name": "LongList",
"fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
]
}
This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it.
Does this PR introduce any user-facing change?
Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth.
How was this patch tested?
Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite.
Was this patch authored or co-authored using generative AI tooling?
No.
@WweiL @bogao007 PTAL. Thanks!
@cloud-fan Could you please review it? Thanks!
cc @HeartSaVioR and @rangadi
@gengliangwang Would you mind helping reviewing the change as you've been one of the main reviewers for Avro? I can give a try, but I don't feel like I'm qualified to review and sign-off.
Friendly reminder, @gengliangwang
@HeartSaVioR Thanks for the ping. I will find time to review this one recently.
Thanks for the review routing! When it's convenient, can people assign this PR to @WweiL and @hkulyc so that we can keep track of reviews? Thanks again!
With this one merged https://github.com/apache/spark/pull/48043 This PR can be closed. Thanks for everyone who worked / reviewed this PR!
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!