DataFrames.jl
DataFrames.jl copied to clipboard
Support type-based column selectors
Currently, when applying a transformation to all columns of a specific type (or subtypes of an abstract type), a pattern such as transform(df, names(df, Number) .=> f)
is used.
Ideally, this could be achieved with a column-selector, e.g. transform(df, Cols(Number) .=> f)
.
While a minor convenience feature, this may make the column-selector API (even) more consistent and users don't have to repeat the name of the DataFrame multiple times.
Yes - I just need to think if there are any corner cases that would lead to problems. We could even potentially allow df[:, Number]
it if does not lead to problems.
OK - now I remember why we do not have this.
Except the names
function all other column selectors currently get resolved in the context of AbstractIndex
not AbstractDataFrame
(i.e. have only access to column names, but to not have access to column contents).
So adding the requested functionality would require a significant redesign. This is of course doable.
@nalimilan - what do you think?
I agree it would be nice to be able to do transform(df, Cols(Number) .=> f)
at least. But yeah the implementation may not be trivial. (This was discussed briefly at https://github.com/JuliaData/DataFrames.jl/pull/2400.)
@nalimilan - I can do it. The only issue is that the PR might end up being 1000 lines and touch many files so it will be hard to review (not sure yet - maybe it will be easier). Essentially we need to drop using AbstractIndex
almost everywhere and instead pass around AbstractDataFrame
. This is challenging because we need to correctly handle all types that DataFrames.jl defines (as deep down they all use AbstractIndex
somewhere).
In other words the original design of DataFrames.jl assumes such functionality will not be needed (AbstractIndex
is not aware of column element types) so we need to change fundamental element of the design here.
@nalimilan - let us make a decision if we:
- add it in 1.4 release.
- postpone to later releases for a decision.
- keep the things as they are (i.e. require
names(df, "type")
syntax).
I would like to finalize the scope of 1.4 release so that we can have it before JuliaCon.
I move it to 1.5 release for a decision
I was thinking about it. The issue is that AbstractIndex
was designed as:
https://github.com/JuliaData/DataFrames.jl/blob/b240458aca1681e74a94e979a0141b2b16f1a3e0/src/other/index.jl#L1
so it - by design - only supports name lookup.
Now the issue is that to create a DataFrame
, we have to construct its index before. So we even cannot naturally have a back-refrence to a data frame in index.
In summary this means that it is a major redesign of DataFrame
, SubDataFrame
, DataFrameRow
, Index
, and SubIndex
if we wanted to allow for such a change. One particular consequence is that 1.5 release would be incompatible with 1.4 release on binary level (and people often serialize/jld data frames).
@nalimilan - the question is if we want to do it.
An alternative would be to special case such selector before passing it to index, but this will lead to ugly design (in many places we will have to apply a patch that is hard to maintain).
After more thinking I am giving it a 1.x milestone. Maybe we will add it at some point, but it is not likely we will do it fast. For now users need to use names
or work with eachcol
to filter on element type of a column.
In this issue let us track all request for basing column selection on column values (as column element type is just a special case).
In this post I discuss the choice in more detail.
If you feel we should add this functionality please vote up: 👍. If you feel it is OK not to have a special syntax for it please vote down: 👎.
Thank you!
My two cents about why I wouldn't recommend adding a new method:
-
The operation can be implemented in other ways already. The more methods to implement the same feature, the harder to read code written by third parties. This aspect affects new users, who would become really confused about what methods to learn when they're learning the language.
-
Somewhat related to 1, adding new syntax for DataFrames forces the new users to learn syntax specific to Dataframes (even if it's just to read other people's code). This is problematic if they're learning the Julia language in general.
-
From what's described, the implementation doesn't seem so easy and there are some issues involved. In a context where it's not trivial, I think implementing other features would be more beneficial. For example, any performance improvement seems more beneficial than implementing one more method for the same (e.g., I read somewhere about an improvement of groupby operations when there are a lot of small groups).
I appreciate the thoughtful examples in the blog post! With the examples you’ve given there, I think I should be able to wrap this functionality within TidierData.jl. The only piece I’m concerned about is making sure I escape the data frame in the right place since I have a bunch of functions that parse and modify the expression along the way. Will let you know if I run into roadblocks.
Looks like an interesting feature. I like it being an explicit functionality, as it makes it easier to find in the documentation. I was not able to find examples of value-based column selection in the DataFrames.jl documentation.
If there is no performance benefit of
select(df, Cols(startswith("a")) .& Vals(x -> any(ismissing(x))))
over
select(df, [startswith(string(n), "a") && any(ismissing, c)
for (n,c) in pairs(eachcol(df))])
perhaps it might as well be done by a macro in DataFramesMeta?
If work with PCA or cor(Matrix), better with Number Type, how to define supertype ?
using Pipe,Tidier
df =load_csv("airbnb_nyc_2019",false)
type_df=@pipe describe(df)|>select(_,[:variable,:eltype])
int_df=@chain type_df begin
@filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
end
@filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
there are better way to define this type ?
Hi @math4mad,
Thanks for the question. Just to clarify, are you asking:
- In general, how to identify super types?
- Or how to get this code to work in TidierData.jl?
- Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?
Or all of the above?
That may help with tailoring the reply a bit better. Thanks!
Hi @math4mad,
Thanks for the question. Just to clarify, are you asking:
- In general, how to identify super types?
- Or how to get this code to work in TidierData.jl?
- Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?
Or all of the above?
That may help with tailoring the reply a bit better. Thanks!
just select columns containing Numerical super-type
just select columns containing Numerical super-type
Do you mean to select all columns (denoted col
below) for which:
-
eltype(col) <: Number
-
all(x -> x isa Number, col)
-
eltype(col) <: Union{Missing, Number}
-
all(x -> x isa Union{Missing, Number}, col)
(I am listing four most common cases you might want to select.
just select columns containing Numerical super-type
Do you mean to select all columns (denoted
col
below) for which:
eltype(col) <: Number
all(x -> x isa Number, col)
eltype(col) <: Union{Missing, Number}
all(x -> x isa Union{Missing, Number}, col)
(I am listing four most common cases you might want to select.
at now I think would be option 2