DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Support Functors as Functions in columns transformation

Open jeremiedb opened this issue 3 years ago • 6 comments
trafficstars

This issue relates to the transformations dispatch mechanism that doesn't recognize Functors as Functions as discussed on discourse .

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

Here’s a functor that applies learned normalization:

using DataFrames
using Statistics: mean, std

struct Normalizer
    μ
    σ
end

Normalizer(x::AbstractVector) = Normalizer(mean(x), std(x))

function (m::Normalizer)(x::Real)
    return (x - m.μ) / m.σ
end

function (m::Normalizer)(x::AbstractVector)
    return (x .- m.μ) ./ m.σ
end

df = DataFrame(:v1 => rand(5), :v2 => rand(5))
feat_names = names(df)
norms = map((feat) -> Normalizer(df[:, feat]), feat_names)

The following doesn’t work:

transform(df, feat_names .=> norms .=> feat_names)
ERROR: LoadError: ArgumentError: Unrecognized column selector: "v1" => (Normalizer(0.5407170762469404, 0.1599492895436335) => "v1")

However, somewhat surprisingly, using ByRow does work:

transform(df, feat_names .=> ByRow.(norms) .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   1 │  0.0386826   0.479449
   2 │  0.919179   -1.61432
   3 │  1.05579     0.584841
   4 │ -0.930937    0.854153
   5 │ -1.08272    -0.304124

So to use the vectorized form, it seems like a mapping of the Functors into Functions is required:

norms_f = map(f -> (x) -> f(x), norms)
transform(df, feat_names .=> norms_f .=> feat_names)
5×2 DataFrame
 Row │ v1          v2        
     │ Float64     Float64
─────┼───────────────────────
   1 │  0.0386826   0.479449
   2 │  0.919179   -1.61432
   3 │  1.05579     0.584841
   4 │ -0.930937    0.854153
   5 │ -1.08272    -0.304124

I can see that there’s a not too complicated way to circumvent the functor limitation through that remapping. Yet, isn’t it counterintuitive to see the Functor works in the ByRow but not in the vectorized case? Although dispatch happens differently under ByRow, from a user perspective,

Having the opportunity to recognize Functors as Functions in the transform would be their most natural handling in my opinion.

jeremiedb avatar Jan 07 '22 17:01 jeremiedb

The challenge is that we already have quite a complex system of rules how these transformations are interpreted, see:

julia> using DataFrames

julia> methods(DataFrames.normalize_selection)
# 14 methods for generic function "normalize_selection":
[1] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{AbstractString, Signed, Symbol, Unsigned}, renamecols::Bool)
[2] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), <:AbstractString}, renamecols::Bool)
[3] normalize_selection(idx::DataFrames.AbstractIndex, sel::Colon, renamecols::Bool)
[4] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{typeof(nrow), Symbol}, renamecols::Bool)
[5] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Pair{<:Union{Function, Type}, <:Union{AbstractString, Symbol}}}, renamecols::Bool)
[6] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:AbstractString}, renamecols::Bool)
[7] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, Symbol}, renamecols::Bool)
[8] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{Function, Type}}, renamecols::Bool)
[9] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Union{AbstractString, Signed, Symbol, Unsigned}, <:Union{AbstractVector{Symbol}, AbstractVector{<:AbstractString}}}, renamecols::Bool)
[10] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Pair{<:Union{Function, Type}, <:Union{AbstractVector{Symbol}, AbstractString, DataType, Function, Symbol, AbstractVector{<:AbstractString}}}}, renamecols::Bool)
[11] normalize_selection(idx::DataFrames.AbstractIndex, sel::Pair{<:Any, <:Union{Function, Type}}, renamecols::Bool)
[12] normalize_selection(idx::DataFrames.AbstractIndex, sel::typeof(nrow), renamecols::Bool)
[13] normalize_selection(idx::DataFrames.AbstractIndex, sel::Union{Function, Type}, renamecols::Bool)
[14] normalize_selection(idx::DataFrames.AbstractIndex, sel, renamecols::Bool)

and it is quite tricky to mess with them. I will think of what can be done here.

@nalimilan - do you have any opinion here?

bkamins avatar Jan 07 '22 21:01 bkamins

I'm also hesitant in general to accept objects of any type as it can create ambiguities, but I have to admit not supporting non-Function functors is a bit annoying. In theory, we could consider that any type which isn't known to be an index is a function or functor, right? The main risk would be if some types can be both, but that's not too likely hopefully.

I have a use case where I use Functors as pre-trained features transformations. In such context, defining those structs as sub-types of Function doesn’t seem a natural choice as a system.

@jeremiedb "Natural" is very hard to define. Any particular reason why you wouldn't want your functors to inherit from Function? The reasons I can see is 1) you cannot inherit from two different types, 2) by default, the compiler only specializes on Functions arguments when they are called (i.e. accessing fields is not enough), though you can force specialization by having a type parameter.

nalimilan avatar Jan 08 '22 16:01 nalimilan

"Natural" is very hard to define.

Sorry for the vague wording. I had 1) in mind, that is having a type hierarchy of transformation functions such as:

abstract type Projector end

struct Normalizer <: Projector
    μ
    σ
end

struct Quantilizer <: Projector
    quantiles
end

jeremiedb avatar Jan 08 '22 21:01 jeremiedb

Yes, but I assume that @nalimilan wants to understand why not have Projector <: Function?

bkamins avatar Jan 08 '22 21:01 bkamins

Oh I just didn't realized it could makes sense! But you're right, by doing abstract type Projector <: Function end, then it works. Defining the Functors as subtypes of Function is a minimal modification, so it seems like a legitimate trick, perhaps it just needs some disclaimer somewhere.

jeremiedb avatar Jan 08 '22 21:01 jeremiedb

so it seems like a legitimate trick

For me (and I guess also @nalimilan) this is natural. Then you, through type hierarchy, signal that your object is callable.

Note that this is not a unique feature of DataFrames.jl. Actually 143 methods in base Julia rely on the fact that some object is callable, e.g. to quote some common ones replace!, findfirst (and similar) etc..

bkamins avatar Jan 08 '22 22:01 bkamins