xml2 read_html removes closing tags from JSON-LD when using a single option
xml2::read_html(x) returns the HTML within a linked data JSON object as expected:
library(xml2)
library(magrittr)
library(rvest)
test_ld <- '<script type="application/ld+json">{"@context":"http://schema.org","@type":"ReproducibleExample", "description":"<p><strong>text within tags</strong>text after closing tag</p>"'
# tags preserved
test_ld %>%
read_html() %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"
Where description contains the HTML <p><strong>text within tags</strong>text after closing tag</p>
But if using xml2::read_html(x, options = 'HUGE') or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.
# tags removed
test_ld %>%
read_html(options = 'HUGE') %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
# removed
test_ld %>%
read_html(options = "NOBLANKS") %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
# removed
test_ld %>%
read_html(options = '') %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
# all return:
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tagstext after closing tag\"</script
description now becomes <p><strong>text within tagstext after closing tag
Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?
If multiple options are set the HTML is correct:
test_ld %>%
read_html(options = c("RECOVER", "NOERROR", "NOBLANKS")) %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
# or
test_ld %>%
read_html(options = c("HUGE", "RECOVER")) %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"
description is as it should be <p><strong>text within tags</strong>text after closing tag</p>
I'm not sure there's much we can do here, but leaving open because I have some suspicions that something is going wrong with the way we pass the options from R to C.