vctrs
vctrs copied to clipboard
Avoid dispatch when manipulating df proxies
I think I need some help here.
I have defined vec_restore() for the tsibble object, which solves the attribute updating issue when vec_slice() a tsibble. But it breaks bind_rows() that uses vec_rbind() internally.
In the tsibble code, I need to make sure that there's no NA in my index column, and validate_index() does the job. I have no idea when NA has been introduced in the vec_restore(). When I debug through my validate_index(), it will not trigger that error because of no NA.
library(tsibble)
library(vctrs)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:tsibble':
#>
#> interval
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
idx_day <- seq.Date(ymd("2017-01-01"), ymd("2017-01-20"), by = 4)
tsbl <- tsibble(
date = idx_day,
value = rnorm(5),
index = date
)
tsbl
#> # A tsibble: 5 x 2 [4D]
#> date value
#> <date> <dbl>
#> 1 2017-01-01 0.965
#> 2 2017-01-05 0.0740
#> 3 2017-01-09 -1.35
#> 4 2017-01-13 -0.257
#> 5 2017-01-17 -0.814
new_data(tsbl)
#> # A tsibble: 1 x 1 [4D]
#> date
#> <date>
#> 1 2017-01-21
vec_rbind(tsbl, new_data(tsbl))
#> Error: Column `date` (index) must not contain `NA`.
Created on 2020-05-29 by the reprex package (v0.3.0)
The vec_restore() implemention sits in this branch https://github.com/tidyverts/tsibble/tree/vctrs-vec-restore
I think this is a bug on our end. We work on the proxy without removing the class, which means that we then dispatch on all the vctrs methods while working with the proxy (e.g. vec_init(proxy).
We can't really remove the class for all proxies because atomic vectors can't be shallow-duplicated (even with altrep on recent R there is some overhead). We should at least do it with data frames though.
The other solution is to duplicate our internal API with proxy variants that never dispatch.
In the mean time I'll look into stripping the class of data frame subclasses in vec_proxy(). This should fix this issue. Hopefully this doesn't impact revdeps of tidyr and dplyr and we'll be able to include this fix in the upcoming release.
@earowang Can you please add this method in your package while we figure this out?
#' @export
vec_proxy.tbl_ts <- function(x, ...) {
new_data_frame(x)
}
This strips all the attributes of your data frame and prevents dispatching again during internal memory manipulations.
@earowang, I've also been working on vctrs/dplyr compatibility for rsample and dials. One thing that I have found there is that the vec_restore() method will have to decide where x is still a valid tsibble or not, and either:
- fall back to a tibble if it isn't valid
- continue restoring back to a tsibble if it is valid
For example, with the current implementation of vec_restore.tsibble in that branch, it is always assumed that you can restore back to a valid tsibble, which isn't always true. Here's an example where vec_slice() makes duplicate rows. The [ implementation knows how to fall back to a tibble, but the vec_restore() method that gets called at the end of vec_slice() currently always assumes the result is a valid tsibble.
# devtools::install_github("tidyverts/tsibble", ref = "vctrs-vec-restore")
library(vctrs)
library(tsibble)
# falls back to tibble b/c of duplicates in index
pedestrian[c(1, 1, 2),]
#> # A tibble: 3 x 5
#> Sensor Date_Time Date Time Count
#> <chr> <dttm> <date> <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
#> 2 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
#> 3 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826
# oops
vec_slice(pedestrian, c(1, 1, 2))
#> # A tsibble: 3 x 5 [1h] <Australia/Melbourne>
#> # Key: Sensor [1]
#> Sensor Date_Time Date Time Count
#> <chr> <dttm> <date> <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
#> 2 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
#> 3 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826
Created on 2020-05-29 by the reprex package (v0.3.0)
Here are the rsample and dials PRs. I've added a lot of notes that you might find useful: https://github.com/tidymodels/rsample/pull/150 https://github.com/tidymodels/dials/pull/114
The general pattern that is emerging would be for you to have a tsibble_reconstructable(x, to) function that holds all of the logic for deciding when x is still a valid tsibble or not (relative to to, which will be the original valid tsibble before any transformation happened). It returns TRUE if you can reconstruct, otherwise it returns FALSE.
Then you have a tsibble_reconstruct() method that looks like:
tsibble_reconstruct(x, to) {
if (tsibble_reconstructable(x, to)) {
# reconstruct the tsibble
} else {
# "upcast" `x` to a bare tibble
}
}
Then you could consistently use this helper in the dplyr 1.0.0 methods and vec_restore(). For dials and rsample it is as simple as this, but it might not be for tsibble:
vec_restore.tsibble(x, to) {
tsibble_reconstruct(x, to)
}
dplyr_reconstruct.tsibble(x, to) {
tsibble_reconstruct(x, to)
}
Much appreciated for all your responses.
@lionel- yep, adding vec_proxy() makes everything work again.
@DavisVaughan Actually a bit of correctness has been sacrificed for performance. My understanding is that vec_restore() is called many times in {vctrs}. If I validate to, it is costly for {tsibble} for little gain.
A lot of the time it is called with empty data (when doing vec_ptype() which calls vec_slice()). Also with the proxy workaround, it will be called fewer times. So it might not be this costly to validate the data.
I think an overall goal for 0.4.0 is to make sure restore is only called on complete data.
I passed R CMD check --as-cran locally on macOS and win-builder devel. When I submitted to CRAN, vec_rbind() errors.
https://win-builder.r-project.org/incoming_pretest/tsibble_0.9.0_20200531_234022/Windows/00check.log
This is very strange. For some reasons it looks like your vec_proxy() method is not registered or picked up properly on this machine?
Which version of vctrs are you using locally? Note that I sent vctrs 0.3.1 to win-builder last week, and it "remembers" the last sent version, even for other packages. So maybe tsibble passes tests with 0.3.1, but not 0.3.0?
oh it would make sense that the proxy thing in vec-rbind does not work on 0.3.0. @DavisVaughan changed the implementation to take the proxy before the loop, rather than inside the loop.
In that case, you'll need to wait before vctrs 0.3.1 is on CRAN. I was holding it for https://github.com/r-spatial/sf/issues/1390, but I might send it anyway since it's not about a regression.
Oh yea, I was using 0.3.1, and thought 0.3.1 was on CRAN. Indeed, it doesn't work with vctrs 0.3.0.
When do you plan to send 0.3.1 for CRAN? I'll have to submit tsibble to fix CRAN check errors with dplyr v1.0.0.
@earowang on CRAN!