DataFrames.jl
DataFrames.jl copied to clipboard
Support Functors as Functions in columns transformation
This issue relates to the transformations dispatch mechanism that doesn't recognize Functors as Functions as discussed on discourse .
I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.
Here’s a functor that applies learned normalization:
using DataFrames
using Statistics: mean, std
struct Normalizer
μ
σ
end
Normalizer(x::AbstractVector) = Normalizer(mean(x), std(x))
function (m::Normalizer)(x::Real)
return (x - m.μ) / m.σ
end
function (m::Normalizer)(x::AbstractVector)
return (x .- m.μ) ./ m.σ
end
df = DataFrame(:v1 => rand(5), :v2 => rand(5))
feat_names = names(df)
norms = map((feat) -> Normalizer(df[:, feat]), feat_names)
The following doesn’t work:
transform(df, feat_names .=> norms .=> feat_names)
ERROR: LoadError: ArgumentError: Unrecognized column selector: "v1" => (Normalizer(0.5407170762469404, 0.1599492895436335) => "v1")
However, somewhat surprisingly, using ByRow does work:
transform(df, feat_names .=> ByRow.(norms) .=> feat_names)
5×2 DataFrame
Row │ v1 v2
│ Float64 Float64
─────┼───────────────────────
1 │ 0.0386826 0.479449
2 │ 0.919179 -1.61432
3 │ 1.05579 0.584841
4 │ -0.930937 0.854153
5 │ -1.08272 -0.304124
So to use the vectorized form, it seems like a mapping of the Functors into Functions is required:
norms_f = map(f -> (x) -> f(x), norms)
transform(df, feat_names .=> norms_f .=> feat_names)
5×2 DataFrame
Row │ v1 v2
│ Float64 Float64
─────┼───────────────────────
1 │ 0.0386826 0.479449
2 │ 0.919179 -1.61432
3 │ 1.05579 0.584841
4 │ -0.930937 0.854153
5 │ -1.08272 -0.304124
I can see that there’s a not too complicated way to circumvent the functor limitation through that remapping. Yet, isn’t it counterintuitive to see the Functor works in the ByRow but not in the vectorized case? Although dispatch happens differently under ByRow, from a user perspective,
Having the opportunity to recognize Functors as Functions in the transform would be their most natural handling in my opinion.
The challenge is that we already have quite a complex system of rules how these transformations are interpreted, see:
julia> using DataFrames
julia> methods(DataFrames.normalize_selection)
# 14 methods for generic function "normalize_selection":
[1] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{AbstractString, Signed, Symbol, Unsigned}, renamecols::Bool)
[2] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), <:AbstractString}, renamecols::Bool)
[3] normalize_selection(idx::DataFrames.AbstractIndex, sel::Colon, renamecols::Bool)
[4] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), Symbol}, renamecols::Bool)
[5] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Pair{<:Union{Function, Type}, <:Union{AbstractString, Symbol}}}, renamecols::Bool)
[6] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:AbstractString}, renamecols::Bool)
[7] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, Symbol}, renamecols::Bool)
[8] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{Function, Type}}, renamecols::Bool)
[9] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{AbstractVector{Symbol}, AbstractVector{<:AbstractString}}}, renamecols::Bool)
[10] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Pair{<:Union{Function, Type}, <:Union{AbstractVector{Symbol}, AbstractString, DataType, Function, Symbol, AbstractVector{<:AbstractString}}}}, renamecols::Bool)
[11] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Union{Function, Type}}, renamecols::Bool)
[12] normalize_selection(idx::DataFrames.AbstractIndex, sel::typeof(nrow), renamecols::Bool)
[13] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{Function, Type}, renamecols::Bool)
[14] normalize_selection(idx::DataFrames.AbstractIndex, sel, renamecols::Bool)
and it is quite tricky to mess with them. I will think of what can be done here.
@nalimilan - do you have any opinion here?
I'm also hesitant in general to accept objects of any type as it can create ambiguities, but I have to admit not supporting non-Function functors is a bit annoying. In theory, we could consider that any type which isn't known to be an index is a function or functor, right? The main risk would be if some types can be both, but that's not too likely hopefully.
I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.
@jeremiedb "Natural" is very hard to define. Any particular reason why you wouldn't want your functors to inherit from Function? The reasons I can see is 1) you cannot inherit from two different types, 2) by default, the compiler only specializes on Functions arguments when they are called (i.e. accessing fields is not enough), though you can force specialization by having a type parameter.
"Natural" is very hard to define.
Sorry for the vague wording. I had 1) in mind, that is having a type hierarchy of transformation functions such as:
abstract type Projector end
struct Normalizer <: Projector
μ
σ
end
struct Quantilizer <: Projector
quantiles
end
Yes, but I assume that @nalimilan wants to understand why not have Projector <: Function?
Oh I just didn't realized it could makes sense! But you're right, by doing abstract type Projector <: Function end, then it works.
Defining the Functors as subtypes of Function is a minimal modification, so it seems like a legitimate trick, perhaps it just needs some disclaimer somewhere.
so it seems like a legitimate trick
For me (and I guess also @nalimilan) this is natural. Then you, through type hierarchy, signal that your object is callable.
Note that this is not a unique feature of DataFrames.jl. Actually 143 methods in base Julia rely on the fact that some object is callable, e.g. to quote some common ones replace!, findfirst (and similar) etc..