tibble icon indicating copy to clipboard operation
tibble copied to clipboard

Provide rbind() and cbind() methods

Open krlmlr opened this issue 3 years ago • 14 comments

for R >= 4.0.0.

Closes #34.

krlmlr avatar Jul 21 '21 04:07 krlmlr

After #890, do we still want to use vec_rbind() here? Should rbind() (and also dplyr::bind_rows() for that matter) call dplyr_row_slice() to support data frame classes with variable-length attributes?

krlmlr avatar Jul 31 '21 05:07 krlmlr

Tried that, doesn't work: vec_rbind() does much more, e.g. adjustment of factor levels.

I still wonder what the desired semantics for dplyr_reconstruct() are if nrow(x) != nrow(to) .

krlmlr avatar Jul 31 '21 05:07 krlmlr

Docs say "you need to carefully check and think about it", I guess we're fine here.

krlmlr avatar Jul 31 '21 05:07 krlmlr

Just started the revdeps with

library(revdepcheck)

pkgs <- unique(c(cran_revdeps("tidyr"), cran_revdeps("tibble"), cran_revdeps("dplyr")))
cloud_check(revdep_packages = pkgs)

hadley avatar Aug 02 '21 14:08 hadley

Thanks @hadley!

This shows that:

  1. The following packages are using rbind() or cbind() in a way that is incompatible with vec_rbind() and vec_cbind(): - dat - designr - ExpertChoice - ggiraph - ggiraphExtra - groupr - heemod - neonstore - prettyglm - RKorAPClient - VarBundle - visR - vlda - wpa
  2. All other failing packages might be incompatible with vec_rbind() and vec_cbind(), but they also have failing tests.
  3. We don't know anything yet about the packages we failed to check.

Do we have an easy way to send a message to all affected maintainers?

krlmlr avatar Aug 07 '21 04:08 krlmlr

Before we notify the maintainers, I think it would be worth digging into 2-5 of the failures in more detail to see if there's a common underlying cause.

hadley avatar Aug 16 '21 12:08 hadley

We can find the underlying cause by replacing rbind() with vctrs::vec_rbind() one by one in the packages, same with cbind().

egor rbind() failure has the following underlying reason:

library(tibble)

rbind(tibble(.b = integer()), tibble(a = 1, .b = 2))
#> # A tibble: 1 × 2
#>       a    .b
#>   <dbl> <dbl>
#> 1     1     2
vctrs::vec_rbind(tibble(.b = integer()), tibble(a = 1, .b = 2))
#> # A tibble: 1 × 2
#>      .b     a
#>   <dbl> <dbl>
#> 1     2     1

Created on 2021-10-20 by the reprex package (v2.0.1)

krlmlr avatar Oct 20 '21 09:10 krlmlr

ypr checks for row names in its tests:

library(tibble)

row.names(rbind(tibble(a = 1), data.frame(a = c(x = 2))))
#> [1] "1" "x"
row.names(vctrs::vec_rbind(tibble(a = 1), data.frame(a = c(x = 2))))
#> [1] "...1" "x"

Created on 2021-10-20 by the reprex package (v2.0.1)

krlmlr avatar Oct 20 '21 17:10 krlmlr

twoxtwo binds a vector and a tibble:

library(tibble)

base::rbind.data.frame(c("a", "b"), tibble(a = 1, b = 2))
#> # A tibble: 2 × 2
#>   a     b    
#>   <chr> <chr>
#> 1 a     b    
#> 2 1     2
vctrs::vec_rbind(c("a", "b"), tibble(a = 1, b = 2))
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> # A tibble: 2 × 4
#>   ...1  ...2      a     b
#>   <chr> <chr> <dbl> <dbl>
#> 1 a     b        NA    NA
#> 2 <NA>  <NA>      1     2

Created on 2021-10-20 by the reprex package (v2.0.1)

krlmlr avatar Oct 20 '21 17:10 krlmlr

optimall calls cbind() with a vector and relies on auto-naming:

library(tibble)

x <- 2

base:::cbind.data.frame(tibble(a = 1), x)
#>   a x
#> 1 1 2
vctrs::vec_cbind(tibble(a = 1), x)
#> New names:
#> * `` -> ...2
#> # A tibble: 1 × 2
#>       a  ...2
#>   <dbl> <dbl>
#> 1     1     2

Created on 2021-10-20 by the reprex package (v2.0.1)

krlmlr avatar Oct 20 '21 19:10 krlmlr

Especially the last issue looks like a serious incompatibility:

library(tibble)

x <- 2

base:::cbind.data.frame(tibble(a = 1), x)
#>   a x
#> 1 1 2
vctrs::vec_cbind(tibble(a = 1), x)
#> New names:
#> * `` -> ...2
#> # A tibble: 1 × 2
#>       a  ...2
#>   <dbl> <dbl>
#> 1     1     2

Created on 2021-10-20 by the reprex package (v2.0.1)

I suggest the following:

  • separate the changes to rbind() from the changes to cbind()
  • support fine-grained enabling and disabling of the new functionality
    • cbind() calls tbl_cbind() which is provided by us and exported, if and only if getOption("tibble.cbind_base") is unset or FALSE
    • otherwise cbind() forwards to NextMethod()
    • in tbl_cbind() we check another option, "tibble.cbind_trace", which prints diagnostics on the input and output of these methods
    • similarly for rbind()
  • depending on the new revdepcheck results, we might need to take action and tweak tbl_rbind() and tbl_cbind()
  • we schedule the change for tibble 4.0.0 and warn the maintainers well ahead of the release

This allows us and the maintainers to easier locate the source of a problem: we can gradually replace cbind() by tbl_cbind() and set option(tibble.cbind_base = TRUE), and/or set options(Tibble.cbind_trace = TRUE) to understand problems.

tbl_rbind(), tbl_cbind() and the options are implemented in this PR.

Do we still want to pursue this path?

krlmlr avatar Oct 20 '21 20:10 krlmlr

The story with rbind() and cbind() is growing so complex, I'd almost prefer to just leave them as they are, and advise people to use vec_rbind() and vec_cbind() directly instead. Two reasons I feel this is reasonable:

  • I imagine most data analysis focused users will already be using the tidyverse and bind_rows/cols() directly, so they wouldn't need this
  • I imagine that package developers who might not want to depend on dplyr would be fine depending on vctrs, since tibble already Imports it.

I think the important point is the second one. We now at least have a good answer for people who want to be able to combine tibbles without relying on dplyr (i.e. use vctrs).

DavisVaughan avatar Oct 20 '21 21:10 DavisVaughan

Seems like you could link to vec_rbind() in the tibble::add_row() documentation. i.e. "If you would like to row bind multiple data frames together, see..."

Similar for vec_cbind() and tibble::add_column().

DavisVaughan avatar Oct 20 '21 21:10 DavisVaughan

Agreed, it's a lot of work for limited gain. Depends on how badly we want to support rbind() and cbind(), i.e. if we gain other advantages by overriding these methods for tibbles.

krlmlr avatar Oct 22 '21 06:10 krlmlr