DataFrames.jl
DataFrames.jl copied to clipboard
Sampling GroupedDataFrames (rand)
Hello,
Currently, we cannot sample from a GroupedDataFrame directly.
julia> df = DataFrame(rand(100000, 100), :auto);
gdf = groupby(df, :x1);
# Code above from #3102
rand(gdf) # MethodError
Stacktrace
ERROR: MethodError: no method matching Random.Sampler(::Type{TaskLocalRNG}, ::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, ::Val{1})
Closest candidates are:
Random.Sampler(::Type{<:AbstractRNG}, ::Random.Sampler, ::Union{Val{1}, Val{Inf}})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
Random.Sampler(::Type{<:AbstractRNG}, ::Any, ::Union{Val{1}, Val{Inf}})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:183
Random.Sampler(::Type{<:AbstractRNG}, ::BitSet, ::Union{Val{1}, Val{Inf}})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/generation.jl:450
...
Stacktrace:
[1] Random.Sampler(T::Type{TaskLocalRNG}, sp::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
[2] Random.Sampler(rng::TaskLocalRNG, x::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:139
[3] rand(rng::TaskLocalRNG, X::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
[4] rand(rng::TaskLocalRNG, X::GroupedDataFrame{DataFrame})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
[5] rand(X::GroupedDataFrame{DataFrame})
@ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:260
[6] top-level scope
@ REPL[228]:3
One way to circumvent that MethodError is to sample from the idx
julia> df = DataFrame(rand(100000, 100), :auto);
gdf = groupby(df, :x1);
julia> indices = rand(1:length(gdf), 10^6) # Many more indexations than groups.
# Code above is from #3102
julia> getindex.(Ref(gdf), indices) # Sample works
Code: #3102
What would be needed to implement this interface? Or, is it undesirable to do so?
versioninfo and package version
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_REVISE_POLL = 1
JULIA_EDITOR = code
JULIA_NUM_THREADS =
(env) pkg> status DataFrames
Status `~/Project.toml`
[a93c6f00] DataFrames v1.6.1
EDIT: reproducible on v1.7.0 (main)