DataFrames.jl Support type-based column selectors

Support type-based column selectors

Open wolthom opened this issue 2 years ago • 17 comments

Currently, when applying a transformation to all columns of a specific type (or subtypes of an abstract type), a pattern such as transform(df, names(df, Number) .=> f) is used. Ideally, this could be achieved with a column-selector, e.g. transform(df, Cols(Number) .=> f).

While a minor convenience feature, this may make the column-selector API (even) more consistent and users don't have to repeat the name of the DataFrame multiple times.

Apr 01 '22 20:04 wolthom

Yes - I just need to think if there are any corner cases that would lead to problems. We could even potentially allow df[:, Number] it if does not lead to problems.

Apr 01 '22 20:04 bkamins

OK - now I remember why we do not have this.

Except the names function all other column selectors currently get resolved in the context of AbstractIndex not AbstractDataFrame (i.e. have only access to column names, but to not have access to column contents).

So adding the requested functionality would require a significant redesign. This is of course doable.

@nalimilan - what do you think?

Apr 02 '22 10:04 bkamins

I agree it would be nice to be able to do transform(df, Cols(Number) .=> f) at least. But yeah the implementation may not be trivial. (This was discussed briefly at https://github.com/JuliaData/DataFrames.jl/pull/2400.)

Apr 06 '22 20:04 nalimilan

@nalimilan - I can do it. The only issue is that the PR might end up being 1000 lines and touch many files so it will be hard to review (not sure yet - maybe it will be easier). Essentially we need to drop using AbstractIndex almost everywhere and instead pass around AbstractDataFrame. This is challenging because we need to correctly handle all types that DataFrames.jl defines (as deep down they all use AbstractIndex somewhere).

In other words the original design of DataFrames.jl assumes such functionality will not be needed (AbstractIndex is not aware of column element types) so we need to change fundamental element of the design here.

Apr 06 '22 21:04 bkamins

@nalimilan - let us make a decision if we:

add it in 1.4 release.
postpone to later releases for a decision.
keep the things as they are (i.e. require names(df, "type") syntax).

I would like to finalize the scope of 1.4 release so that we can have it before JuliaCon.

May 08 '22 12:05 bkamins

I move it to 1.5 release for a decision

Jun 07 '22 07:06 bkamins

I was thinking about it. The issue is that AbstractIndex was designed as: https://github.com/JuliaData/DataFrames.jl/blob/b240458aca1681e74a94e979a0141b2b16f1a3e0/src/other/index.jl#L1

so it - by design - only supports name lookup.

Now the issue is that to create a DataFrame, we have to construct its index before. So we even cannot naturally have a back-refrence to a data frame in index.

In summary this means that it is a major redesign of DataFrame, SubDataFrame, DataFrameRow, Index, and SubIndex if we wanted to allow for such a change. One particular consequence is that 1.5 release would be incompatible with 1.4 release on binary level (and people often serialize/jld data frames).

@nalimilan - the question is if we want to do it.

An alternative would be to special case such selector before passing it to index, but this will lead to ugly design (in many places we will have to apply a patch that is hard to maintain).

Dec 23 '22 20:12 bkamins

After more thinking I am giving it a 1.x milestone. Maybe we will add it at some point, but it is not likely we will do it fast. For now users need to use names or work with eachcol to filter on element type of a column.

Feb 05 '23 08:02 bkamins

In this issue let us track all request for basing column selection on column values (as column element type is just a special case).

In this post I discuss the choice in more detail.

If you feel we should add this functionality please vote up: 👍. If you feel it is OK not to have a special syntax for it please vote down: 👎.

Thank you!

Aug 17 '23 21:08 bkamins

My two cents about why I wouldn't recommend adding a new method:

The operation can be implemented in other ways already. The more methods to implement the same feature, the harder to read code written by third parties. This aspect affects new users, who would become really confused about what methods to learn when they're learning the language.
Somewhat related to 1, adding new syntax for DataFrames forces the new users to learn syntax specific to Dataframes (even if it's just to read other people's code). This is problematic if they're learning the Julia language in general.
From what's described, the implementation doesn't seem so easy and there are some issues involved. In a context where it's not trivial, I think implementing other features would be more beneficial. For example, any performance improvement seems more beneficial than implementing one more method for the same (e.g., I read somewhere about an improvement of groupby operations when there are a lot of small groups).

Aug 18 '23 15:08 alfaromartino

I appreciate the thoughtful examples in the blog post! With the examples you’ve given there, I think I should be able to wrap this functionality within TidierData.jl. The only piece I’m concerned about is making sure I escape the data frame in the right place since I have a bunch of functions that parse and modify the expression along the way. Will let you know if I run into roadblocks.

Aug 22 '23 16:08 kdpsingh

Looks like an interesting feature. I like it being an explicit functionality, as it makes it easier to find in the documentation. I was not able to find examples of value-based column selection in the DataFrames.jl documentation.

If there is no performance benefit of

select(df, Cols(startswith("a")) .& Vals(x -> any(ismissing(x))))

over

select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])

perhaps it might as well be done by a macro in DataFramesMeta?

Aug 24 '23 12:08 tp2750

If work with PCA or cor(Matrix), better with Number Type, how to define supertype ?

using  Pipe,Tidier

df =load_csv("airbnb_nyc_2019",false)
type_df=@pipe describe(df)|>select(_,[:variable,:eltype])
int_df=@chain type_df begin
    @filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
end

@filter(isa(eltype,Union{Type{Int64},Type{Float64}})) there are better way to define this type ?

Dec 15 '23 23:12 math4mad

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

In general, how to identify super types?
Or how to get this code to work in TidierData.jl?
Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

Dec 16 '23 03:12 kdpsingh

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

In general, how to identify super types?

Or how to get this code to work in TidierData.jl?

Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

just select columns containing Numerical super-type

Dec 16 '23 12:12 math4mad

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

eltype(col) <: Number
all(x -> x isa Number, col)
eltype(col) <: Union{Missing, Number}
all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

Dec 16 '23 17:12 bkamins

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

eltype(col) <: Number

all(x -> x isa Number, col)

eltype(col) <: Union{Missing, Number}

all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

at now I think would be option 2

Dec 17 '23 04:12 math4mad

DataFrames.jl DataFrames.jl copied to clipboard

Support type-based column selectors

DataFrames.jl
DataFrames.jl copied to clipboard