spark add Support Java Class with circular references

What changes were proposed in this pull request?

If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.

This PR will add an option for developer to decide whether they would like to skip the circular field, or leave the application to fail.

Why are the changes needed?

If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Issue: https://issues.apache.org/jira/browse/SPARK-33598

Aug 31 '22 08:08 1zg12

Can one of the admins verify this patch?

Aug 31 '22 15:08 AmplabJenkins

Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode

Aug 31 '22 16:08 srowen

Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode

If it's a field the developer/application is comfortable with having self/circular reference, from Spark perspective I think it should allow the developer to stop the loop gracefully (which is to skip further processing the field in loop at developers' own judgement).

This PR is not to force for either way (stop the whole application immediately as existing or skip the filed if the developer choose to), but leave for the developer a choice to choose. Ultimately, it's the developer building their own application have best knowledge how to handle it.

I guess Spark probably assumed the circular reference must be a mistake made by the developers/application earlier. But it can really be a valid case even it could be rare.

Sep 01 '22 03:09 1zg12

Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes

Sep 01 '22 03:09 srowen

Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes

Google Protobuf is an example, it's widely used as a data class. In the protobuf class, there is an attribute called Descriptor (this is a generated filed) which circular reference back.

Current spark implementation doesn't work with protobuf.

There are some other examples on the issue: https://issues.apache.org/jira/browse/SPARK-33598

Sep 01 '22 04:09 1zg12

Hey All , Is there any update on this PR. I am currently running into an issue where I have a JSON schema that has circular references I am using Encoders.bean(classOf[Example.class]) to map the input to the POJO generated from that schema .it fails because of the circular reference

Nov 28 '22 23:11 jkhalid

Hey all, Wondering if this PR (or some similar fix got merged). I have similar issues with circular references and it would be great to have an option to skip the check. PS: My example is not related to protobuf but is related to the POJO classes generated from the XSD schemas for FHIR - https://hl7.org/fhir/downloads.html Thanks.

Dec 20 '22 07:12 venkyvb

Hi, I am having this circular reference problem while processing the Kafka avro messages with Spark 3.3.0.

Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java

Mar 17 '23 08:03 lsgrep

Hi @srowen , would you consider supporting avro schemas as a valid reason for supporting this feature as avro is pretty popular in general? Thanks

Mar 17 '23 08:03 lsgrep

Still seems weird to me -- Does this happen to even be 'enough' for the protobuf case? Or does this extra unwanted descriptor field add other unneeded cols? Is it 'too much' - Is it skipping real circular references that matter, but can't translate to tabular schemas? Is it solvable by just subclassing the bean class and hiding the field that isn't desirable to begin with? How does the circular ref arise in the avro case, different?

I get it just seems a bit too hacky as the 'right' solution. Sometimes hacks are worth it

Mar 17 '23 10:03 srowen

spark spark copied to clipboard

add Support Java Class with circular references

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard