spark
spark copied to clipboard
[SPARK-46108][SQL] keepInnerXmlAsRaw option for Built-in XML Data Source
What changes were proposed in this pull request?
Built-in XML data source gives related value and schema of the inner or nested elements. However, additional operations should be made by developers manually to convert unstructured data to structured, tabular format. If nested elements are kept in a format that is suitable with XML (for each level), we can convert them easily to a structured, tabular format with the existing methods that have already been developed (infer method of XmlInferSchema and parseColumn method of StaxXmlParser). Therefore there should be an option that affects StaxXmlParser and InferSchema classes to keep inner XML elements in their original or raw format.
Why are the changes needed?
- We can easily convert internal or nested XML elements into structured or tabular format by using existing methods with this option.
- For complex XML files, we can apply step-by-step conversion to XML elements at each level. In each step, only the element at the next level is considered.
- With step by step conversion, we can remove or skip unwanted inner XML elements.
- Element-specific operations can be applied to elements at each level.
- Parent-child relationships between elements can be preserved.
- No additional operations are required to convert the unstructured format to the structured format.
"keepInnerXmlAsRaw" option is true:
applying infer and parseColumn methods to PERSON column
applying infer and parseColumn methods to TASK column
"keepInnerXmlAsRaw" option is false:
Does this PR introduce any user-facing change?
Yes, while user creating XML Data Source, new option called as "keepInnerXmlAsRaw" can be set by user to kept inner elements in XML format.
How was this patch tested?
New unit tests are written and tested. Also
$ build/sbt
> project sql
> test
ran successfully.
Was this patch authored or co-authored using generative AI tooling?
No
This isn't a minor .. let's file a JIRA.
@HyukjinKwon I've created the subtask (SPARK-46108) under "Built-in XML data source support (SPARK-44265)", named as "XML: keepInnerXmlAsRaw option". If it is okey for you i can update title of this PR with jira task number.
cc @sandeep-katta @shujingyang-db if you find some time to review.
@ufuksungu Thanks for working on this! I'd like to discuss the proposed use case for this feature. We might have a more straight-forward alternative. For instance, in the Person
example you mentioned in the PR description, if a user prefers to retain the field as a string, they can declare the field as a string type. They can use from_xml
to convert it to a struct type later on. What's your thought on it?
@shujingyang-db
Hey! Thanks for the feedback. I might have misunderstood your scenario, so please correct me if i am wrong. keepInnerXmlAsRaw allows the schema of the person
field to be either a String
or an Array<String>
. For the from_xml function, we still need to infer schema of person column. Therefore we need to use infer method. After that, from_xml can be used instead of parseColumn as you said.
@ufuksungu To ensure we're aligned on the use case of this feature, can you please detail a specific use case of this feature? It will be helpful to start with an XML example, apply a spark command, and then compare the expected results with the current implementation. This will give us a clearer picture of the feature's impact.
@shujingyang-db Current implementation provides related StructType and values of nested tags within the rowTag. My main intention was holding nested tags as XML String while reading xml file. Thus, flattening or fully parsing (Person example in PR description) could be achieved dynamically using existing methods(infer and parseColumn). But, while I was looking at built-in functions written for XML, I noticed that same thing could be done with using to_xml method. With to_xml method, you are able to convert parsed values (current implementation) to Xml String then you can apply infer and parseColumn methods for dynamically flattening. With this way, you'll get the same result as mine. The difference of this feature from using to_xml is extracting Xml String from the very beginning (to_xml will cause 1 extra step which is parsing object from the top to the bottom). I hope I have been able to convey exactly what I wanted to do. Since the same thing can be done with to_xml method, this feature seems unnecessary. What do you think? Btw still if you want to see an example, i can write it.
hello, I know this isn't directly related to this pr but just fyi i'm trying out spark 3.5 in databricks runtime 14.2 and i'm seeing the keepInnerXmlAsRaw = true behavior when using cloud_files(), i.e. the contents of rowTag elements do not get parsed. The read_files() function works fine as expected.
@adriennn Hey, thanks for the heads up. I didn't know about that. This will help to perform flatten complex xml file dynamically in Spark 3.x versions.
@ufuksungu Thanks for the follow-up! We appreciate your efforts on this feature. In this case, I believe the from_xml
and to_xml
have similar functionalities as keepInnerXmlAsRaw
.
@adriennn Can you please share your XML example and your code? Would love to understand the issue
@shujingyang-db Thank you for your support and thoughts throughout the PR process. You're right. As is seen I think the request needs to be closed.
@HyukjinKwon shall I close the PR? Also what about SPARK-46108
@shujingyang-db shared to you by email.
To preserve the nested XML, simply set the corresponding column type to string in the schema. I don't think this is necessary
@srowen, let's say your inner xml contains Name="a" Surname="b" SecId="1" with PERSON tag and the root tag is TEAMS. After reading XML source with "rowTag", "TEAMS" you get "{a, 1, b, null}" or something like similar for PERSON tag. After casting operation, it cannot be parsed to structured form. As well as same thing goes for if you not casting it to String, after reading operation, yes you have StructType with related StructFields but it needs an additional transformations. What I aimed in here, getting Structured Data by using inner functions of XML. Am I missing something from your comment, or are there some functionalities that i don't know from original repository of XML.
I read this again and I still don't understand it. Are you trying to parse, or not-parse, some subset of the XML? both are already possible though, and I can't see a use case for anything else.
@srowen, Before this PR, I came across a scenario and that time i was using original repository of XML (also i know you are one of the contributors of that repository). Scenario was, assume that I have some complex xml files with lots of layers in it and each of them presents different tags. So at each tag, I need to apply specific operations or transformations on the related data and save it to somewhere else. Basically, apply operation to that level and save it then proceed to next or inner tag (and some tags can be irrelevant as well, lets say after 8th level). Basically I want to apply my operations to xml level by level (and lazily). And if i want to proceed nested tag, I need to apply some transformations manually to get in structured format. But repository already have some functions (InferSchema.infer and StaxXmlParser.parseColumn) to convert xml to structured data. But if I want to use them I think I need to keep inner xmls in original format. At that time, I tought, it made more sense to use existing functions. And because of that I've opened this PR.
You can preserve subtrees of the XML as strings as I mentioned above. You can further process the fragment with an XML parser, or with from_xml even.
@srowen I understand your point. As far as I understand, from_xml function expects the string in XML format. Is my understanding correct, or am I mistaken? If I have misunderstood, I got what you mean.
For the opposite case,
When i read the xml file, the result is as seen in the photo. Does from_xml is able to read this kind of format? Or did I do the read operation wrong? Definition of the read was something like defining rowTag and inferSchema options then load the xml file.
Yes, you use from_xml to parse XML that is already a fragment that should become a parsed struct. It does not read already-parsed data; it parses. My premise is that you do not need to parse PERSON; it can be parsed as string. Then you do what you want. Then you parse it later as you want. But I'm still not sure that's what you're talking about, in which case I'm not sure what the use case is.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!