xmlconvert
xmlconvert copied to clipboard
Reading XML with both attributes and fields elements
Hi,
I am trying to use xmlconvert function xml_to_df to read an XML file that contains properties in both attributes and XML own elements. A simplified example :
<xml>
<country id="US" name="United States of America">
<property name="currency" value="USD"/>
</country>
</xml>
I manage to recover the properties (id, name) for each country, but not to list in addition currency.
To further complete my example, I also have nested (hierarchical) XML elements which I am not interested to extract (but they seem to interfere with the extraction. Completing my previous example :
<xml>
<country id="US" name="United States of America">
<property name="currency" value="USD"/>
<city id ="NYC" name = "New York">
<property name="area" value="121260"/>
<property name="population" value="8175133"/>
</city>
</country>
</xml>
Hi,
Thanks for the remarks and the example.
So far, xmlconvert cannot deal with cases in which both the field name and the field's value are attributes.
If you do
xmlconvert::xml_to_df("test.xml", records.tags = "country",
fields = "tags", field.names = "name")
then you get:
currency New.York
1 NA |property~||property~|
The reason for the 'currency' NA is that in our call of xml_to_df() we say our fields are represented by tags and their names are given by their name attribute. But the tag identified by attribute name = 'currency' has no value. Instead the value is represented by another attribute, value, while xml_to_df() expects the value to be the value of the tag. This results in NA.
The second 'field', New.York, comes from the fact that the city (a direct descendant of our record element country) happens to have a name attribute, as well. So xml_to_df() thinks: Great, here is another field! It then retrieves the 'value' of that 'field', too, and that is the flattened hierarchy below city. This can be avoided by using the only.fields argument to specify the fields that will be extracted:
xmlconvert::xml_to_df("test.xml", records.tags = "country",
fields = "tags", field.names = "name",
only.fields = "currency")
resulting in:
currency
1 NA
I know, this is still not what you want, but it is due to the fact that xml_to_df() so far has no way of working with data where a field name/field value combination is represented by two different attributes of the same tag.
But this is an exciting feature that I will add to the next version of xmlconvert which will presumably be released around Christmas / New Year.
Best, Joachim
Hi, Thank you for your analysis and reply. I am aware my XML file is far from being clean. I did not even realised the fact that values being a tag instead of a value would be a problem.
I would be happy to try & test new functionalities ! Best, Pierre