dt_separate
Hi Tyson,
I have been playing around a bit with your package. I like that for the most part it works fine and that it really only depends on data.table and Rcpp. I have found that dt_separate with tidyfast 0.21 from CRAN does not work for my use case. I have not tried the development version yet.
library(tidyfast)
library(data.table)
data <- readRDS(url("https://shiny.rstudio.com/tutorial/written-tutorial/lesson5/census-app/data/counties.rds"))
head(data)
#> name total.pop white black hispanic asian
#> 1 alabama,autauga 54571 77.2 19.3 2.4 0.9
#> 2 alabama,baldwin 182265 83.5 10.9 4.4 0.7
#> 3 alabama,barbour 27457 46.8 47.8 5.1 0.4
#> 4 alabama,bibb 22915 75.0 22.9 1.8 0.1
#> 5 alabama,blount 57322 88.9 2.5 8.1 0.2
#> 6 alabama,bullock 10914 21.9 71.0 7.1 0.2
setDT(data)
# Apply separate
dt_separate(data, name, c("state", "county"))
# Nothing
head(data)
#> name total.pop white black hispanic asian
#> 1: alabama,autauga 54571 77.2 19.3 2.4 0.9
#> 2: alabama,baldwin 182265 83.5 10.9 4.4 0.7
#> 3: alabama,barbour 27457 46.8 47.8 5.1 0.4
#> 4: alabama,bibb 22915 75.0 22.9 1.8 0.1
#> 5: alabama,blount 57322 88.9 2.5 8.1 0.2
#> 6: alabama,bullock 10914 21.9 71.0 7.1 0.2
# Tidyr works fine
tidyr::separate(data, name, c("state", "county"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 645 rows [25, 58, 79,
#> 111, 122, 143, 152, 163, 164, 165, 175, 191, 192, 193, 194, 195, 196, 197, 198,
#> 199, ...].
#> state county total.pop white black hispanic asian
#> 1: alabama autauga 54571 77.2 19.3 2.4 0.9
#> 2: alabama baldwin 182265 83.5 10.9 4.4 0.7
#> 3: alabama barbour 27457 46.8 47.8 5.1 0.4
#> 4: alabama bibb 22915 75.0 22.9 1.8 0.1
#> 5: alabama blount 57322 88.9 2.5 8.1 0.2
#> ---
#> 3078: wyoming teton 21294 82.2 1.9 15.0 1.1
#> 3079: wyoming uinta 21118 88.5 2.3 8.8 0.3
#> 3080: wyoming washakie 8533 83.9 2.6 13.6 0.6
#> 3081: wyoming weston 7208 93.8 2.0 3.0 0.3
#> 3082: new mexico 76569 36.2 5.4 58.3 0.5
# This streagely appears to duplicate the information
dt_separate(data, name, c("state", "county"), immutable = FALSE)
head(data)
#> total.pop white black hispanic asian state county
#> 1: 54571 77.2 19.3 2.4 0.9 alabama,autauga alabama,autauga
#> 2: 182265 83.5 10.9 4.4 0.7 alabama,baldwin alabama,baldwin
#> 3: 27457 46.8 47.8 5.1 0.4 alabama,barbour alabama,barbour
#> 4: 22915 75.0 22.9 1.8 0.1 alabama,bibb alabama,bibb
#> 5: 57322 88.9 2.5 8.1 0.2 alabama,blount alabama,blount
#> 6: 10914 21.9 71.0 7.1 0.2 alabama,bullock alabama,bullock
Created on 2021-08-02 by the reprex package (v0.3.0)
Then, I see you use data.table::copy by default if immutable = FALSE. This appears to me highly inefficient as copy copies the whole table where base R's copy on modify would be much more efficient. In general, I think there is somewhat an issue around data.table having propagated the notion that copies in base R are inefficient. They are not anymore since R 3.5.0 where shallow copies were introduced in base R. To prove the point:
library(data.table)
library(collapse)
library(microbenchmark)
dat <- qDT(list(x = rnorm(1e8)))
# Returns a shallow copy, function is written entirely in base R
microbenchmark(ftransform(dat, y = x + 1), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq
#> ftransform(dat, y = x + 1) 540.7881 563.6548 630.8815 579.5532 615.984
#> max neval
#> 854.4274 5
# Modify by reference
microbenchmark(dat[, y := x + 1], times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> dat[, `:=`(y, x + 1)] 464.3758 629.7212 664.0154 664.0158 724.2668 837.6971
#> neval
#> 5
# The cost of a shallow copy in base R:
tracemem(dat) # Tracing memory
#> [1] "<0000000007A66F20>"
# This makes a shallow copy
oldClass(dat) <- c("data.table", "data.frame")
#> tracemem[0x0000000007a66f20 -> 0x00000000101ae658]
dat[, y := x + 1] # Data.table also found it (gives a warning, I don't know why it is not shown here)
untracemem(dat)
# Let's benchmark this
v <- c("data.table", "data.frame")
microbenchmark(oldClass(dat) <- v)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> oldClass(dat) <- v 1.338 1.339 1.85214 1.785 1.785 22.758 100
# This creates two shallow copies + overallocatng of 100 columns (to be able to trick data.table into thinking
# the table was not copied and to be able to add columns by reference into empty column pointers using := afterwards)
tracemem(dat)
#> [1] "<000000001059BB98>"
dat <- qDT(dat)
#> tracemem[0x000000001059bb98 -> 0x00000000101ae118]: qDT_raw alc qDT
#> tracemem[0x00000000101ae118 -> 0x00000000101ae218]: qDT_raw alc qDT
dat[, y := x + 1] # Allows me to do this without a warning.
untracemem(dat)
# Cost:
microbenchmark(dat <- qDT(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> dat <- qDF(dat) 4.462 4.909 13.88302 5.8015 9.371 655.538 100
# Or better:
alc <- collapse:::alc
microbenchmark(dat <- alc(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> dat <- alc(dat) 2.231 2.677 3.01686 2.678 2.678 30.791 100
Created on 2021-08-02 by the reprex package (v0.3.0)
In fact, Matt will probably disagree with me for one reason or another, but if I were to redesign data.table with R 3.5.0 in place, I'd get rid of the whole mechanisms avoiding shallow copies (through overallocated data.tables, ".internal.selfref" attributes etc, and just focus on avoiding deep copies. I believe the only place were there are significant gains from avoiding shallow copies in R are inside tight loops such as the example given here in looping data.frame subsets (I think even in that case [[.data.frame itself probably takes out much more speed than the shallow copies it creates). So in summary: I think doing this without data.table::copy will be much faster at any data size.
Finally, talking about loops, I just had a brief glance at the C code. In lines 52-62 of fill.cpp: if you have no particular reason to use STRING_ELT every time, I'd also create string pointers SEXP* xin = STRING_PTR(x); SEXP* xout = STRING_PTR(out); and then index the pointers as in the other loops. STRING_ELT creates a pointer every time and uses it to subset, so you can simply outsource that step from the loop with STRING_PTR.