drill icon indicating copy to clipboard operation
drill copied to clipboard

XML format plugin concatenates attribute values from multiple sub-elements with the same name

Open KendraKrat opened this issue 4 years ago • 4 comments
trafficstars

When an XML element has multiple sub-elements of the same name, and those sub-elements have attributes, the attribute values get concatenated in a way that it's impossible to separate.

For example, start with the documentation's published "list of books" example. Add three sub-elements named "extra" to one of the books, each having two attributes (name and value). The following is excerpted from an XML that I have attached.

<book> <author>Mark Twain</author> <title>The Adventures of Tom Sawyer</title> <category>FICTION</category> <year>1876</year> <extra name="width" value="6"/> <extra name="height" value="10"/> <extra name="depth" value="2"/> </book>

The output for this turns into: +-----------------------------------------------------------------+------------+---------------------------------+-------------+------+-----------------------------------------+ | attributes | author | title | category | year | authors | +-----------------------------------------------------------------+------------+---------------------------------+-------------+------+-----------------------------------------+ | {"extra_name":"widthheightdepth","extra_value":"6102"} | Mark Twain | The Adventures of Tom Sawyer | FICTION | 1876 | {} |

It shows only one value for the "extra_name" attributes, which is the concatenation of the names "width", "height", and "depth" into "widthheightdepth". Similarly it only shows one value for the "extra_value" attributes, which is the concatenation of the values "6" "10" and "2" into "6102". Unfortunately it's impossible to know how to separate those concatenated strings.

I would have expected to see something like one of the following for the attributes output instead, so that the different attribute values are separable:

{{"extra_name":"width","extra_value":"6"},{"extra_name":"height","extra_value":"10"},{"extra_name":"depth","extra_value":"2"}} or {"extra_name":["width","height","depth"],"extra_value":["6","10","2"]}

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: N/A
  • Version: 1.19.0

books-multiple-extras.xml.txt

KendraKrat avatar Sep 02 '21 19:09 KendraKrat

@KendraKrat Thanks for reporting this. The issue here is that Drill is using a streaming reader and doesn't know the schema in advance. Drill sees the first field and interprets that as an empty VARCHAR field with two attributes. Then, it sees the next field with the same name, extra and same attributes and has no way to determine the intent of the data.

I would actually argue that this isn't a great way to format XML, but often we're stuck with what the data provider gives us, so it's a moot point.

I've thought about adding list support for the XML reader which would partially address this, however the real way would be to add provided schema and XSD support. That way you can explicitly tell Drill what to expect in terms of schema.

cgivre avatar Sep 06 '21 14:09 cgivre

@cgivre I completely agree on all points. Schema/XSD support would be best, plus one for that suggestion! I also agree that I don't like the format of this XML example; however, as you pointed out, this is the data I have to work with which was produced elsewhere and I don't have control over the format. :/

KendraKrat avatar Sep 10 '21 20:09 KendraKrat

@KendraKrat I don't know if you're still following this, but I am working on extending the XML reader to accept XSD and provided schemata. I don't think it will make it for this release, but it shouldn't be too long.

cgivre avatar Dec 26 '21 15:12 cgivre

That will also fix the other bug you reported.

cgivre avatar Dec 26 '21 15:12 cgivre