arrow [R] slice_sample returns 0 rows

Describe the bug, including details regarding any error messages, version, and platform.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf)
open_dataset(tf) %>%
  slice_sample(n = 3) %>%
  collect()
#> # A tibble: 0 × 11
#> # ℹ 11 variables: mpg <dbl>, disp <dbl>, hp <dbl>, drat <dbl>, wt <dbl>,
#> #   qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>, cyl <int>

^{Created on 2023-11-08 with reprex v2.0.2}

Component(s)

R

Nov 08 '23 15:11 thisisnic

I think this is an implementation issue and we need to re-implement this differently; if I run this code repeatedly, sometimes I do get a number of rows equal or fewer to n back.

Nov 09 '23 13:11 thisisnic

I have a probably related issue where slice_sample(n = 100) tends to sample the same rows (out of a Table with 2922121 rows), and from the beginning of the Table.
The row count always respects n.

If I specify the expected row count with a proportion:

nr <- nrow(tbl_df)
slice_sample(tbl_df, prop = 100/nr)

I encounter the above issue (not exactly 100 rows but sometimes fewer or more), but the rows are truly randomized.

Nov 30 '23 10:11 lgaborini

Thanks for the extra information there @lgaborini!

I've looked at this again, and I think it's an unfortunate quirk of the original implementation (i.e. a known issue), as we've had to implement it a little differently as the C++ random function doesn't work, e.g. https://github.com/apache/arrow/pull/14361#issue-1403214998.

I've tried updating the min parameter in the internal UDF to higher than the default (we get fewer rows selected) or lower than the default (we get the right number of rows selected but we get a lot of repetition).

There's this line that just takes the first n rows of data, which is probably the source of the lack of randomness. I was wondering if we can call arrange() to order by the random number and then take the top n rows, though I'm not sure if that will actually work or not.

Dec 12 '23 11:12 thisisnic

I was wondering if we can call arrange() to order by the random number and then take the top n rows

I think that will work, although I don't know if it will be slower or faster than calling compute() (i.e., get me a Table) and subset using integers obtained using sample(seq_len(x$num_rows)). It is essentially the same thing: in order to do an accurate sample, the final number of rows are needed.

One can do a streaming (but approximate) sampling, too, which might be useful for non-statistical purposes (e.g., testing on something more realistic than the first n rows of data).

Jan 03 '24 17:01 paleolimbot

I'm still seeing very non-random sampling with slice_sample() in Arrow 17.0.0. In a 400M row dataset spanning 2023-2024, a 10k row sample consistently does not contain timestamps later than Jan 2024. I'm guessing this is the known issue described above, but if a reprex would be helpful, I can put one together.

As this issue could be dangerous for someone assuming a random sample, should there be a note in the docs or slice_sample() removed until it's fixed?

Dec 06 '24 18:12 blongworth