xml2 icon indicating copy to clipboard operation
xml2 copied to clipboard

[FR] How to create an S3 vector (using `{vctrs}`) whose prototype is an `xml_node` object?

Open ramiromagno opened this issue 1 year ago • 1 comments

I'd like to use an xml_nodeset object as a column in a data frame.

Here are my naive attempts:

library(xml2)
library(tibble)
library(dplyr)

x <- read_xml("<parent><child>1</child><child>2<child>3</child></child><manzz>1</manzz></parent>")
xml_nodeset_obj <- xml_children(x)
xml_nodeset_obj_unclassed <- unclass(xml_nodeset_obj)

# Won't work
tibble(xml_nodeset_obj)

tbl <- tibble(child = xml_nodeset_obj_unclassed)

tbl |>
  mutate(child = xml_find_all(y, xpath = './child'))

# Post tibble creation cheating won't work either :)
class(tbl$child) <- "xml_nodeset"

After reading one of the errors obtained with code above:

Error in `vec_size()`:
! `x` must be a vector, not a <xml_nodeset> object.
Run `rlang::last_error()` to see where the error occurred.

I understand now that it might be possible to make an xml_nodeset object by following the instructions provided here: S3 vectors, right?

Should I try to implement this myself in a package of my own, or is this functionality desirable in {xml2}?

My objective is to provide similar functionality to tidyjson but for XML data.

ramiromagno avatar Jul 31 '22 15:07 ramiromagno

Would you be so kind to provide feedback on my approach here to make an S3 vector out of an xml_node type. Bear with me as I just read S3 vectors vignette, specifically the part on list-of types.

I am indicating the prototype as structure(logical(), class = 'xml_node') using logical() as dummy. Not sure how to do this given that {xml2} does not (?) provide a function to instantiate a xml_node object.

For some reason the tibble is not showing the elements of column x (see below).

The prefix XXX is a placeholder for an hypothetical R package where the class XXX_xml_node would be registered.

library(xml2)
library(tibble)
library(vctrs)
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame

new_XXX_xml_node <- function(x) {
  vctrs::new_list_of(x,
                     ptype = structure(logical(), class = 'xml_node'),
                     class = "XXX_xml_node")
}

XXX_xml_node <- function(x) {
  new_XXX_xml_node(x)
}

vec_ptype_full.XXX_xml_node <- function(x, ...) "XXX_xml_node"
vec_ptype_abbr.XXX_xml_node <- function(x, ...) "xml_node"

as_XXX_xml_node <- function(x, ...) UseMethod("as_XXX_xml_node")

as_XXX_xml_node.xml_node <- function(x, ...) {
  XXX_xml_node(x)
}

as_XXX_xml_node.xml_nodeset <- function(x, ...) {
  XXX_xml_node(unclass(x))
}

as_XXX_xml_node.xml_document <- function(x, ...) {
  # Convert xml_document to a list of xml_node objects
  xx <- unclass(xml2::xml_children(x))
  XXX_xml_node(xx)
}

format.XXX_xml_node <- function(x, ...) {
  desc <-
    encodeString(vapply(x, as.character, FUN.VALUE = character(1)))
  paste0(substr(desc, 1, 20 - 3), "...")
}

obj_print_data.XXX_xml_node <- function(x, ...) {
  if (length(x) == 0)
    return()
  print(format(x), quote = FALSE)
}

# Example application
x <- read_xml("
                <parent>
                  <child>1</child>
                  <child>2</child>
                  <child>3</child>
                  <child>4</child>
                  <child>
                    <grandchildren>5.1</grandchildren>
                    <grandchildren>5.2</grandchildren>
                    <grandchildren>5.3</grandchildren>
                  </child>
                  <child>
                    <grandchildren>6.1</grandchildren>
                    <grandchildren>6.2</grandchildren>
                    <child>6.2</child>
                  </child>
                </parent>")

as_XXX_xml_node(x)
#> <XXX_xml_node[6]>
#> [1] <child>1</child>...   <child>2</child>...   <child>3</child>...  
#> [4] <child>4</child>...   <child>\\n  <grand... <child>\\n  <grand...

(tbl <- tibble::tibble(x = as_XXX_xml_node(x), i = seq_along(x)))
#> # A tibble: 6 × 2
#>            x     i
#>   <xml_node> <int>
#> 1                1
#> 2                2
#> 3                3
#> 4                4
#> 5                5
#> 6                6

ramiromagno avatar Aug 03 '22 14:08 ramiromagno

Closing in favour of #377. I don't have vctrs loaded in my brain at the moment, so I can't offer any concrete feedback on what you tried, but I think we should just do this right in the package so that you and others don't need to worry about it.

hadley avatar Oct 30 '23 18:10 hadley