xml2 icon indicating copy to clipboard operation
xml2 copied to clipboard

xml2 read_html removes closing tags from JSON-LD when using a single option

Open sbha opened this issue 3 years ago • 2 comments

xml2::read_html(x) returns the HTML within a linked data JSON object as expected:

library(xml2)
library(magrittr)
library(rvest)

test_ld <- '<script type="application/ld+json">{"@context":"http://schema.org","@type":"ReproducibleExample", "description":"<p><strong>text within tags</strong>text after closing tag</p>"'

# tags preserved
test_ld %>% 
  read_html() %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

Where description contains the HTML <p><strong>text within tags</strong>text after closing tag</p>

But if using xml2::read_html(x, options = 'HUGE') or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.

# tags removed
test_ld %>% 
  read_html(options = 'HUGE') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = "NOBLANKS") %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = '') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# all return:
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tagstext after closing tag\"</script

description now becomes <p><strong>text within tagstext after closing tag

Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?

sbha avatar Oct 03 '22 12:10 sbha

If multiple options are set the HTML is correct:

test_ld %>% 
  read_html(options = c("RECOVER", "NOERROR", "NOBLANKS")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# or
test_ld %>% 
  read_html(options = c("HUGE", "RECOVER")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()


[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

description is as it should be <p><strong>text within tags</strong>text after closing tag</p>

sbha avatar Oct 03 '22 13:10 sbha

I'm not sure there's much we can do here, but leaving open because I have some suspicions that something is going wrong with the way we pass the options from R to C.

hadley avatar Oct 30 '23 18:10 hadley