vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

`vec_c` and class `pseries` from pkg `plm`

Open tappek opened this issue 2 years ago • 2 comments

I noticed a strange issue if dplyr is loaded while I implemented a subsetting method for class pseries in pkg plm. I think I boild it down to vctrs::vec_c hence I post here (I came from this line in dplyr but it seems irrelevant for the topic: https://github.com/tidyverse/dplyr/blob/04454209ea069939d3335c43846c85c725547a89/R/lead-lag.R#L72; the issue is triggered due to dplyr clobbering baser R's lag, see https://github.com/tidyverse/dplyr/issues/1586, https://github.com/tidyverse/dplyr/issues/2195 - but that is not my point here).

A pseries is built on top of vectors and factor, the class attribute is c("pseries", "<basic_class>") where <basic_class> = numeric, integer, ..., factor. A pseries features an index attribute of the same length in rows (a data.frame with two factors and additional class c("pindex", "data.frame"), which needs to be subset in the same manner.

pseries have been around for long without a subsetting method, so subsetting dispatched to base R subsetting for vectors/factor, thus removing all pseries features. I aim to implement a pseries subsetting method to make the class "more complete" in the sense that subsetting a pseries results in a pseries.

vec_c does a lot of calls to [.pseries which seems strange to me, as do the inputs fed to [.pseries. Also, the result is somewhat a mixture of the various subsetting calls performed, where the index attribute seems to be the part that is a result of one subsetting step (subset by integer()) but it does not fit to the numeric part returned which seems to be taken from another call to [.pseries (subset by 1:3 for a 3-entry vector as in the example below).

Here is a reproducible example with dev version 2.4-1.99999 of plm rev 1312 and a reduced and debugging enabled [.pseries method hooked in:

library(plm) # 2.4-1.99999 / rev. 1312 as provided in link above
data("Grunfeld", package = "plm")
pGrunfeld <- pdata.frame(Grunfeld)
pser_num <- pGrunfeld$inv # class is c("pseries", "numeric")

`[.pseries` <- function(x, ...) {
  ## not fully sane, reduced to illustrate
  # debug printing:
  print("[.pseries executed with input:")
  cat("\n")
  print("x = ")
  print(x)
  dots <- list(...)
  cat("\n")
  print("ellipsis: ")
  print(dots)
  cat("\n")
  
  # save index, to be subset and attached later on  
  ix <- attr(x, "index")
  
  # handles names, also to identify rows of be subet for index
  names_orig <- names(x)
  keep_ix_rownr <- seq_along(x) # full length row numbers original pseries
  names(keep_ix_rownr) <- names_orig
  
  if(is.null(names_orig)) {
    # if no names are present, set names as integer sequence to identify
    # rows to keep in index later
    names(x) <- keep_ix_rownr
    names(keep_ix_rownr) <- keep_ix_rownr
  }
  
  # remove pseries features to dispatch to base R subsetting
  attr(x, "index") <- NULL
  class(x) <- setdiff(class(x), "pseries")
  result <- x[...] # actual subsetting
  
  keep_ix_rownr <- keep_ix_rownr[names(result)]
  if(is.null(names_orig)) names(result) <- NULL # if not names were present, null names in result
  
  # Subset index accordingly:
  ix <- ix[keep_ix_rownr, ]
  ix <- droplevels(ix)
  
  # restore pseries features: class and subset index
  class(result) <- c("pseries", class(result))
  attr(result, "index") <- ix
  return(result)
}

# hook in [.pseries, overwriting the one originally in the dev version of the package
assignInNamespace("[.pseries", `[.pseries`, envir = as.environment("package:plm"))

pser_num <- pser_num[1:3] # make short to ease reading

pser_num_vec_c1 <- vctrs::vec_c(pser_num)     # [.pseries executed 6x, strange inputs
pser_num_vec_c2 <- vctrs::vec_c(NA, pser_num) # [.pseries executed even 14x
str(pser_num_vec_c1) # attr. index present but is destroyed (0-row data.frame)

##### str output (stripped)
##### the 0-row data.frame seems to result from a subsetting by integer()
## 'pseries' Named num [1:3] 318 392 411
## - attr(*, "index")=Classes ‘pindex’ and 'data.frame':	0 obs. of  2 variables:
## ..$ firm: Factor w/ 0 levels:       
## ..$ year: Factor w/ 0 levels: 
##  - attr(*, "names")= chr [1:3] "1-1935" "1-1936" "1-1937"

Any ideas?

Another thing I noticed is that vec_c seems to strict.

### This seems too strict:
vctrs::vec_c(1.1, pser_num)
# Error: Can't combine `..1` <double> and `..2` <pseries>.
### ... because:
class(pser_num)    # c("pseries", "numeric")
typeof(pser_num) # double

Sessioninfo:

> devtools::session_info()
- Session info -----------------------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  German_Germany.1252         
 ctype    German_Germany.1252         
 tz       Europe/Berlin               
 date     2021-09-05                  

- Packages ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 package       * version     date       lib source                         
 assertthat      0.2.1       2019-03-21 [1] CRAN (R 4.1.0)                 
 backports       1.2.1       2020-12-09 [1] CRAN (R 4.1.0)                 
 base64enc       0.1-3       2015-07-28 [1] CRAN (R 4.1.0)                 
 bdsmatrix       1.3-4       2020-01-13 [1] CRAN (R 4.1.0)                 
 boot            1.3-28      2021-05-03 [2] CRAN (R 4.1.1)                 
 broom           0.7.9       2021-07-27 [1] CRAN (R 4.1.0)                 
 cachem          1.0.6       2021-08-19 [1] CRAN (R 4.1.1)                 
 callr           3.7.0       2021-04-20 [1] CRAN (R 4.1.0)                 
 checkmate       2.0.0       2020-02-06 [1] CRAN (R 4.1.0)                 
 cli             3.0.1       2021-07-17 [1] CRAN (R 4.1.0)                 
 cluster         2.1.2       2021-04-17 [2] CRAN (R 4.1.1)                 
 collapse        1.6.5       2021-07-24 [1] CRAN (R 4.1.0)                 
 colorspace      2.0-2       2021-06-24 [1] CRAN (R 4.1.0)                 
 crayon          1.4.1       2021-02-08 [1] CRAN (R 4.1.0)                 
 data.table      1.14.0      2021-02-21 [1] CRAN (R 4.1.0)                 
 DBI             1.1.1       2021-01-15 [1] CRAN (R 4.1.0)                 
 desc            1.3.0       2021-03-05 [1] CRAN (R 4.1.0)                 
 devtools        2.4.2       2021-06-07 [1] CRAN (R 4.1.0)                 
 digest          0.6.27      2020-10-24 [1] CRAN (R 4.1.0)                 
 dplyr           1.0.7       2021-06-18 [1] CRAN (R 4.1.0)                 
 dreamerr        1.2.3       2020-12-05 [1] CRAN (R 4.1.0)                 
 ellipsis        0.3.2       2021-04-29 [1] CRAN (R 4.1.0)                 
 evaluate        0.14        2019-05-28 [1] CRAN (R 4.1.0)                 
 fansi           0.5.0       2021-05-25 [1] CRAN (R 4.1.0)                 
 fastmap         1.1.0       2021-01-25 [1] CRAN (R 4.1.0)                 
 fixest          0.10.0      2021-08-31 [1] Github (lrberge/fixest@9cdd106)
 foreign         0.8-81      2020-12-22 [2] CRAN (R 4.1.1)                 
 Formula         1.2-4       2020-10-16 [1] CRAN (R 4.1.0)                 
 fs              1.5.0       2020-07-31 [1] CRAN (R 4.1.0)                 
 gdata           2.18.0      2017-06-06 [1] CRAN (R 4.1.0)                 
 generics        0.1.0       2020-10-31 [1] CRAN (R 4.1.0)                 
 ggplot2         3.3.5       2021-06-25 [1] CRAN (R 4.1.0)                 
 glue            1.4.2       2020-08-27 [1] CRAN (R 4.1.0)                 
 gridExtra       2.3         2017-09-09 [1] CRAN (R 4.1.0)                 
 gtable          0.3.0       2019-03-25 [1] CRAN (R 4.1.0)                 
 gtools          3.9.2       2021-06-06 [1] CRAN (R 4.1.0)                 
 Hmisc           4.5-0       2021-02-28 [1] CRAN (R 4.1.0)                 
 htmlTable       2.2.1       2021-05-18 [1] CRAN (R 4.1.0)                 
 htmltools       0.5.2       2021-08-25 [1] CRAN (R 4.1.1)                 
 htmlwidgets     1.5.3       2020-12-10 [1] CRAN (R 4.1.0)                 
 jpeg            0.1-9       2021-07-24 [1] CRAN (R 4.1.0)                 
 knitr           1.33        2021-04-24 [1] CRAN (R 4.1.0)                 
 lattice         0.20-44     2021-05-02 [2] CRAN (R 4.1.1)                 
 latticeExtra    0.6-29      2019-12-19 [1] CRAN (R 4.1.0)                 
 lfe             2.8-7       2021-07-31 [1] CRAN (R 4.1.0)                 
 lifecycle       1.0.0       2021-02-15 [1] CRAN (R 4.1.0)                 
 lme4            1.1-27.1    2021-06-22 [1] CRAN (R 4.1.0)                 
 lmtest          0.9-38      2020-09-09 [1] CRAN (R 4.1.0)                 
 magrittr        2.0.1       2020-11-17 [1] CRAN (R 4.1.0)                 
 MASS            7.3-54      2021-05-03 [2] CRAN (R 4.1.1)                 
 Matrix          1.3-4       2021-06-01 [2] CRAN (R 4.1.1)                 
 maxLik          1.5-2       2021-07-26 [1] CRAN (R 4.1.0)                 
 memoise         2.0.0       2021-01-26 [1] CRAN (R 4.1.0)                 
 mice            3.13.0      2021-01-27 [1] CRAN (R 4.1.0)                 
 minqa           1.2.4       2014-10-09 [1] CRAN (R 4.1.0)                 
 miscTools       0.6-26      2019-12-08 [1] CRAN (R 4.1.0)                 
 munsell         0.5.0       2018-06-12 [1] CRAN (R 4.1.0)                 
 nlme            3.1-152     2021-02-04 [2] CRAN (R 4.1.1)                 
 nloptr          1.2.2.2     2020-07-02 [1] CRAN (R 4.1.0)                 
 nnet            7.3-16      2021-05-03 [2] CRAN (R 4.1.1)                 
 numDeriv        2016.8-1.1  2019-06-06 [1] CRAN (R 4.1.0)                 
 pillar          1.6.2       2021-07-29 [1] CRAN (R 4.1.0)                 
 pkgbuild        1.2.0       2020-12-15 [1] CRAN (R 4.1.0)                 
 pkgconfig       2.0.3       2019-09-22 [1] CRAN (R 4.1.0)                 
 pkgload         1.2.1       2021-04-06 [1] CRAN (R 4.1.0)                 
 plm           * 2.4-1.99999 2021-09-04 [1] R-Forge (R 4.1.1)              
 png             0.1-7       2013-12-03 [1] CRAN (R 4.1.0)                 
 prettyunits     1.1.1       2020-01-24 [1] CRAN (R 4.1.0)                 
 processx        3.5.2       2021-04-30 [1] CRAN (R 4.1.0)                 
 ps              1.6.0       2021-02-28 [1] CRAN (R 4.1.0)                 
 purrr           0.3.4       2020-04-17 [1] CRAN (R 4.1.0)                 
 R6              2.5.1       2021-08-19 [1] CRAN (R 4.1.1)                 
 rbibutils       2.2.3       2021-08-09 [1] CRAN (R 4.1.1)                 
 RColorBrewer    1.1-2       2014-12-07 [1] CRAN (R 4.1.0)                 
 Rcpp            1.0.7       2021-07-07 [1] CRAN (R 4.1.0)                 
 RcppArmadillo   0.10.6.0.0  2021-07-16 [1] CRAN (R 4.1.0)                 
 RcppEigen       0.3.3.9.1   2020-12-17 [1] CRAN (R 4.1.0)                 
 Rdpack          2.1.2       2021-06-01 [1] CRAN (R 4.1.0)                 
 remotes         2.4.0       2021-06-02 [1] CRAN (R 4.1.0)                 
 rlang           0.4.11      2021-04-30 [1] CRAN (R 4.1.1)                 
 rmarkdown       2.10        2021-08-06 [1] CRAN (R 4.1.0)                 
 rpart           4.1-15      2019-04-12 [2] CRAN (R 4.1.1)                 
 rprojroot       2.0.2       2020-11-15 [1] CRAN (R 4.1.0)                 
 rsconnect       0.8.24      2021-08-05 [1] CRAN (R 4.1.0)                 
 rstudioapi      0.13        2020-11-12 [1] CRAN (R 4.1.0)                 
 sandwich        3.0-1       2021-05-18 [1] CRAN (R 4.1.0)                 
 scales          1.1.1       2020-05-11 [1] CRAN (R 4.1.0)                 
 sessioninfo     1.1.1       2018-11-05 [1] CRAN (R 4.1.0)                 
 stringi         1.7.4       2021-08-25 [1] CRAN (R 4.1.1)                 
 stringr         1.4.0       2019-02-10 [1] CRAN (R 4.1.0)                 
 survival        3.2-11      2021-04-26 [2] CRAN (R 4.1.1)                 
 testthat        3.0.4       2021-07-01 [1] CRAN (R 4.1.0)                 
 tibble          3.1.4       2021-08-25 [1] CRAN (R 4.1.1)                 
 tidyr           1.1.3       2021-03-03 [1] CRAN (R 4.1.0)                 
 tidyselect      1.1.1       2021-04-30 [1] CRAN (R 4.1.0)                 
 usethis         2.0.1       2021-02-10 [1] CRAN (R 4.1.0)                 
 utf8            1.2.2       2021-07-24 [1] CRAN (R 4.1.0)                 
 vctrs           0.3.8       2021-04-29 [1] CRAN (R 4.1.1)                 
 weights         1.0.4       2021-06-10 [1] CRAN (R 4.1.0)                 
 withr           2.4.2       2021-04-18 [1] CRAN (R 4.1.0)                 
 xfun            0.25        2021-08-06 [1] CRAN (R 4.1.0)                 
 xtable          1.8-4       2019-04-21 [1] CRAN (R 4.1.0)                 
 yaml            2.2.1       2020-02-01 [1] CRAN (R 4.1.0)                 
 zoo             1.8-9       2021-03-09 [1] CRAN (R 4.1.0)

tappek avatar Sep 05 '21 11:09 tappek

With more complex classes like pseries, which have an attribute that must be sliced alongside the core data, package authors generally have to do a little more work to get their class to work correctly with vctrs / the tidyverse.

A lot of this information is in our vignettes: https://vctrs.r-lib.org/articles/s3-vector.html https://vctrs.r-lib.org/articles/type-size.html

vec_slice() is probably a simpler place to start than vec_c(), where you can see that your [ method is being called correctly:

out <- vctrs::vec_slice(pser_num, c(1, 3))
#> [1] "[.pseries executed with input:"
#> 
#> [1] "x = "
#> 1-1935 1-1936 1-1937 
#>  317.6  391.8  410.6 
#> 
#> [1] "ellipsis: "
#> $i
#> [1] 1 3

out
#> 1-1935 1-1937 
#>  317.6  410.6

attributes(out)$index
#>   firm year
#> 1    1 1935
#> 3    1 1937

vec_c() is more complicated. Essentially we get the common type of the inputs, construct an output container based on that common type that has the right length, fill in the data, and then add on any attributes that came with the common type.

To get the common type, we take 0-length slices of each input, which is why you are seeing [ being called a few times. With S3 classes that we don't know much about, this is our fallback to obtain a prototype (or ptype) for that input. You can see the ptype with vec_ptype()

# it does retain the pseries class even though it says "numeric(0)"
vctrs::vec_ptype(pser_num)
#> named numeric(0)

attributes(vctrs::vec_ptype(pser_num))$index
#> [1] firm year
#> <0 rows> (or 0-length row.names)

When vec_c() has 1 input, this is the common type, so then we build up an output container from this using vec_init()

out <- vctrs::vec_init(vctrs::vec_ptype(pser_num), 5)
out
#> <NA> <NA> <NA> <NA> <NA> 
#>   NA   NA   NA   NA   NA

attributes(out)$index
#>      firm year
#> NA   <NA> <NA>
#> NA.1 <NA> <NA>
#> NA.2 <NA> <NA>
#> NA.3 <NA> <NA>
#> NA.4 <NA> <NA>

Before filling up this output container, we "proxy" it and all of the inputs. Proxying generates an alternative representation of the container that contains basic atomic R types that are easily fillable at the C level. After filling, we finalize the result by "restoring" the proxy back to the original type.

By default, the proxy doesn't do anything for S3 classes we don't know about, but the restore method will copy over the attributes of the original prototype before it was proxied (because they often are static and don't depend on length).

This restore bit is where the issue is for pseries, since it doesn't know not to copy over the index from the original type. We end up copying over the index from the prototype, which has 0 rows.

ptype <- vctrs::vec_ptype(pser_num)

out <- vctrs::vec_init(ptype, 5)
out <- vctrs::vec_proxy(out)
out
#> <NA> <NA> <NA> <NA> <NA> 
#>   NA   NA   NA   NA   NA
attributes(out)$index
#>      firm year
#> NA   <NA> <NA>
#> NA.1 <NA> <NA>
#> NA.2 <NA> <NA>
#> NA.3 <NA> <NA>
#> NA.4 <NA> <NA>

# do the filling of `vec_c()` here

# now restore, copying over `ptype` attributes to `out`
out <- vctrs::vec_restore(out, ptype)

# this would normally have the data from the filling of `vec_c()`
out
#> <NA> <NA> <NA> <NA> <NA> 
#>   NA   NA   NA   NA   NA

# oh no, 0 row attribute
attributes(out)$index
#> [1] firm year
#> <0 rows> (or 0-length row.names)

Since pseries has an attribute that relies on the length and ordering of the input, we'd generally advise creating a vec_proxy() and vec_restore() method to customize these two steps of the process. The proxy could be a two column data frame, where the first column holds the data and the second column holds the index data frame. That way they get sliced and combined together and you don't have to manage them separately. The restoration method would just move the index column back as an attribute.

vec_proxy.pseries <- function(x, ...) {
  x <- unclass(x)
  index <- attr(x, "index", exact = TRUE)
  attr(x, "index") <- NULL
  vctrs::data_frame(x = x, index = index)
}

vec_restore.pseries <- function(x, to, ...) {
  index <- x$index
  x <- x$x
  attr(x, "index") <- index
  class(x) <- c("pseries", class(x))
  x
}

ptype <- vctrs::vec_ptype(pser_num)

# notice the proxy is a data frame now
out <- vctrs::vec_init(vctrs::vec_ptype(pser_num), 5)
out <- vctrs::vec_proxy(out)
out
#>    x index.firm index.year
#> 1 NA       <NA>       <NA>
#> 2 NA       <NA>       <NA>
#> 3 NA       <NA>       <NA>
#> 4 NA       <NA>       <NA>
#> 5 NA       <NA>       <NA>

# do the filling of `vec_c()` here

# now restore from data frame back to pseries
out <- vctrs::vec_restore(out, ptype)
out
#>                
#> NA NA NA NA NA

attributes(out)$index
#>      firm year
#> ...1 <NA> <NA>
#> ...2 <NA> <NA>
#> ...3 <NA> <NA>
#> ...4 <NA> <NA>
#> ...5 <NA> <NA>

With a proxy and restore method in place, vec_c() would work properly (the odd row names are a technical detail and could be cleaned up)

vctrs::vec_c(pser_num, pser_num)
#> 1-1935 1-1936 1-1937 1-1935 1-1936 1-1937 
#>  317.6  391.8  410.6  317.6  391.8  410.6

attributes(vctrs::vec_c(pser_num, pser_num))$index
#>       firm year
#> 1...1    1 1935
#> 2...2    1 1936
#> 3...3    1 1937
#> 1...4    1 1935
#> 2...5    1 1936
#> 3...6    1 1937

DavisVaughan avatar Sep 07 '21 15:09 DavisVaughan

Thank you for the in-depth explanation, very instructive! I read a bit in the vignettes before posting. I reckon one's package would need to hard-depend on vctrs to implement all this? Dependign on yet another package is what one would typically avoid. I could imagine to work around the dependency by suppyling an own generic in the package but I am not sure if the double-dispatching mechanism would work.

I would not expect vec_c to return the "correct" result for a pseries. Intuitvely I assumed vec_c would fall-back to base c if an unknown class or a base R class is encountered (in the first or a later entry in the class attribute, as in c("pseries", "numeric"). That is also my reading of ?vec_c: If inputs inherit from a common class hierarchy, vec_c() falls back to base::c() if there exists a c() method implemented for this class hierarchy.class hierarchy.

Wouldn't mimicing base R behaviour be what most would assume (as we are used to lose attributes etc)?

tappek avatar Sep 09 '21 19:09 tappek

Intuitvely I assumed vec_c would fall-back to base c if an unknown class or a base R class is encountered (in the first or a later entry in the class attribute, as in c("pseries", "numeric"). That is also my reading of ?vec_c

yup but there is no c() method for pseries. So the common type methods must be implemented.

vctrs:::s3_get_method("pseries", "c")
#> NULL

lionel- avatar Oct 03 '22 10:10 lionel-