spark
spark copied to clipboard
add Support Java Class with circular references
What changes were proposed in this pull request?
If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.
This PR will add an option for developer to decide whether they would like to skip the circular field, or leave the application to fail.
Why are the changes needed?
If the target Java data class has a circular reference, Spark will fail fast from creating the Dataset or running Encoders.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Issue: https://issues.apache.org/jira/browse/SPARK-33598
Can one of the admins verify this patch?
Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode
Hm, skipping them doesn't seem right either. Not sure if this should be an option; it is just something that doesn't make sense to encode
If it's a field the developer/application is comfortable with having self/circular reference, from Spark perspective I think it should allow the developer to stop the loop gracefully (which is to skip further processing the field in loop at developers' own judgement).
This PR is not to force for either way (stop the whole application immediately as existing or skip the filed if the developer choose to), but leave for the developer a choice to choose. Ultimately, it's the developer building their own application have best knowledge how to handle it.
I guess Spark probably assumed the circular reference must be a mistake made by the developers/application earlier. But it can really be a valid case even it could be rare.
Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes
Can you describe a valid use case? I can't think of one. Encoders are used with data classes, bean-like classes
Google Protobuf is an example, it's widely used as a data class. In the protobuf class, there is an attribute called Descriptor (this is a generated filed) which circular reference back.
Current spark implementation doesn't work with protobuf.
There are some other examples on the issue: https://issues.apache.org/jira/browse/SPARK-33598
Hey All , Is there any update on this PR. I am currently running into an issue where I have a JSON schema that has circular references I am using Encoders.bean(classOf[Example.class]) to map the input to the POJO generated from that schema .it fails because of the circular reference
Hey all, Wondering if this PR (or some similar fix got merged). I have similar issues with circular references and it would be great to have an option to skip the check. PS: My example is not related to protobuf but is related to the POJO classes generated from the XSD schemas for FHIR - https://hl7.org/fhir/downloads.html Thanks.
Hi, I am having this circular reference problem while processing the Kafka avro messages with Spark 3.3.0.
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java
Hi @srowen , would you consider supporting avro schemas as a valid reason for supporting this feature as avro is pretty popular in general? Thanks
Still seems weird to me -- Does this happen to even be 'enough' for the protobuf case? Or does this extra unwanted descriptor field add other unneeded cols? Is it 'too much' - Is it skipping real circular references that matter, but can't translate to tabular schemas? Is it solvable by just subclassing the bean class and hiding the field that isn't desirable to begin with? How does the circular ref arise in the avro case, different?
I get it just seems a bit too hacky as the 'right' solution. Sometimes hacks are worth it