xml2 icon indicating copy to clipboard operation
xml2 copied to clipboard

The GoT challenge

Open jennybc opened this issue 5 years ago • 7 comments

Transferring from Slack convo.

The GoT data in repurrrsive comes from an API but for teaching purposes, after I wrangle it into the list that ships with the package, I transform it back into JSON and XML files that also ship with the package.

The current script below data-raw/ uses an experimental branch of xml2 I created before xml2 had a nice function for this. I implemented as_xml() and the moral equivalent in xml2 today is as_xml_document(). But I can't just drop it in there.

https://github.com/jennybc/repurrrsive/blob/7ecc79f9a6b1925996ec20d78f0291d91451db31/data-raw/got_pov-chars-wrangle.R#L5-L7

https://github.com/jennybc/repurrrsive/blob/7ecc79f9a6b1925996ec20d78f0291d91451db31/data-raw/got_pov-chars-wrangle.R#L116-L120

> got_chars %>%
+   #xml2:::as_xml() %>%
+   as_xml_document() %>% 
+   as.character() %>% ## workaround to get explicit UTF-8 encoding
+   read_xml() %>%
+   write_xml(here("inst", "extdata", "got_chars.xml"))
Error: Root nodes must be of length 1

I haven't looked at this lately but I think there's either a buglet in as_xml_document() or a UI/documentation gap. It would be nice if I could reproduce the XML already in repurrrsive with current xml2 functions. This is a little detective adventure for someone.

jennybc avatar May 03 '19 15:05 jennybc

Progress on this issue could take many different forms, including:

  • Reproduce the problem for yourself and construct a better, smaller reprex and diagnosis. What's different about the GoT example and the working examples and tests for as_xml_document()?
  • Identify the usage problem in the above code and suggest docs or examples to help people avoid same mistake
  • Find an actual bug
  • Fix that bug

jennybc avatar May 03 '19 16:05 jennybc

@jennybc, I'm confused about the labels. Is this issue related to a specific event or anyone can have a go at it anytime?

Bisaloo avatar May 06 '19 09:05 Bisaloo

@Bisaloo you are welcome to look into it anytime if you like!

But if it is not addressed by the next tidy-dev-day we wanted to mark it as one people could try then.

jimhester avatar May 06 '19 11:05 jimhester

I've created a reprex for this issue using the example from the demo of as_xml() @jennybc did back then.

library(xml2)

y_list <- list(
  setNames(letters[1:4], letters[1:4]),
  two = list(
    setNames(1:4 * 1.1, month.abb[1:4]),
    hey = c(TRUE, FALSE)
  ),
  three = NULL
)
attr(y_list$two, "attrib1") <- "hi"
attr(y_list$two, "attrib2") <- "hey"

y_xml <- as_xml_document(y_list)
#> Error: Root nodes must be of length 1

Created on 2019-07-08 by the reprex package (v0.3.0)

R3myG avatar Jul 08 '19 12:07 R3myG

I've created a reprex for this issue using the example from the demo of as_xml() @jennybc did back then.

library(xml2)

y_list <- list(
  setNames(letters[1:4], letters[1:4]),
  two = list(
    setNames(1:4 * 1.1, month.abb[1:4]),
    hey = c(TRUE, FALSE)
  ),
  three = NULL
)
attr(y_list$two, "attrib1") <- "hi"
attr(y_list$two, "attrib2") <- "hey"

y_xml <- as_xml_document(y_list)
#> Error: Root nodes must be of length 1

Created on 2019-07-08 by the reprex package (v0.3.0)

A way around the issue above is by enclosing y_list into a list but then it hits another error

library(xml2)

y_list <- list(
  setNames(letters[1:4], letters[1:4]),
  two = list(
    setNames(1:4 * 1.1, month.abb[1:4]),
    hey = c(TRUE, FALSE)
  ),
  three = NULL
)
attr(y_list$two, "attrib1") <- "hi"
attr(y_list$two, "attrib2") <- "hey"

y_xml <- as_xml_document(list(y_list = y_list))
#> Error in node_append_content(x$node, value): Expecting a single string value: [type=character; extent=4].

Created on 2019-07-08 by the reprex package (v0.3.0)

R3myG avatar Jul 08 '19 13:07 R3myG

After much digging, it turns out as_xml_document() cannot handle the case where the list has content in it while as_xml() does. This is consistent with the documentation of the function:

This turns an R list into the equivalent XML document. Not all R lists will produce valid XML, in particular there can only be one root node and all child nodes need to be named (or empty) lists. R attributes become XML attributes and R names become XML node names.

library(xml2)

y_list_empty <- list(
  one = list(),
  two = list(),
  three = list()
)
attr(y_list_empty$two, "attrib1") <- "hi"
attr(y_list_empty$two, "attrib2") <- "hey"

y_xml_empty <- as_xml_document(list(y_list_empty = y_list_empty))
y_xml_empty
#> {xml_document}
#> <y_list_empty>
#> [1] <one/>
#> [2] <two attrib1="hi" attrib2="hey"/>
#> [3] <three/>

y_list_full <- list(
  one = list(setNames(letters[1:4], letters[1:4])),
  two = list(),
  three = list()
)
attr(y_list_full$two, "attrib1") <- "hi"
attr(y_list_full$two, "attrib2") <- "hey"

y_xml_full <- as_xml_document(list(y_list_full = y_list_full))
#> Error in node_append_content(x$node, value): Expecting a single string value: [type=character; extent=4].

Created on 2019-07-08 by the reprex package (v0.3.0)

That very last case can be handled by as_xml() with the result below:

> as_xml(y_list_full)
{xml_document}
<root>
[1] <one>\n  <elem>\n    <a>a</a>\n    <b>b</b>\n    <c>c</c>\n    <d>d</d>\n  </elem>\n</one>
[2] <two attrib1="hi" attrib2="hey"/>
[3] <three/>

In my opinion, it needs to be discussed whether or not as_xml_document() is supposed to handle this scenario or not. If yes, then its capabilities need to be expanded to match the capabilities of as_xml() from @jennybc or if populating the xml file with elements is the job of another function.

R3myG avatar Jul 08 '19 14:07 R3myG

Thanks @R3myG for pushing this investigation along.

jennybc avatar Jul 08 '19 15:07 jennybc