DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Sampling GroupedDataFrames (rand)

Open quachpas opened this issue 2 months ago • 5 comments

Hello,

Currently, we cannot sample from a GroupedDataFrame directly.

julia> df = DataFrame(rand(100000, 100), :auto);
          gdf = groupby(df, :x1);
         # Code above from #3102
          rand(gdf) # MethodError
Stacktrace

ERROR: MethodError: no method matching Random.Sampler(::Type{TaskLocalRNG}, ::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, ::Val{1})

Closest candidates are:
  Random.Sampler(::Type{<:AbstractRNG}, ::Random.Sampler, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
  Random.Sampler(::Type{<:AbstractRNG}, ::Any, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:183
  Random.Sampler(::Type{<:AbstractRNG}, ::BitSet, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/generation.jl:450
  ...

Stacktrace:
 [1] Random.Sampler(T::Type{TaskLocalRNG}, sp::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
 [2] Random.Sampler(rng::TaskLocalRNG, x::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:139
 [3] rand(rng::TaskLocalRNG, X::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
 [4] rand(rng::TaskLocalRNG, X::GroupedDataFrame{DataFrame})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
 [5] rand(X::GroupedDataFrame{DataFrame})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:260
 [6] top-level scope
   @ REPL[228]:3

One way to circumvent that MethodError is to sample from the idx

julia> df = DataFrame(rand(100000, 100), :auto);
          gdf = groupby(df, :x1);
julia> indices  = rand(1:length(gdf), 10^6)  # Many more indexations than groups.
# Code above is from #3102
julia> getindex.(Ref(gdf), indices) # Sample works

Code: #3102

What would be needed to implement this interface? Or, is it undesirable to do so?

versioninfo and package version

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_REVISE_POLL = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

(env) pkg> status DataFrames
Status `~/Project.toml`
  [a93c6f00] DataFrames v1.6.1

EDIT: reproducible on v1.7.0 (main)

quachpas avatar Apr 17 '24 08:04 quachpas