DataFrames.jl
DataFrames.jl copied to clipboard
Consider allowing transformations in sort
It would have to be carefully thought about as sort
has a complex API, but we could allow things like:
sort(df, :x => x -> x^2)
(so to allow to sort not only on columns but also on transformations)
This would be great to have.
We still have the problem with GroupedDataFrame
iterating through groups, so we get a conflict with Base
.
sort
is defined only for AbstractDataFrame
now only.
But because a GroupedDataFrame
iterates through groups, to keep a contract with Base
, sort
should sort groups.
Yes, but no one asks for this functionality so it is better to leave it out (so that we have a free hand to work out the details later when we have real use-cases; I want to avoid adding functionality that no one uses for two reasons: a) we are not sure the design is OK, b) if it is not used then it means it is not tested). We have sort
kwarg to sort by grouping columns and I guess it covers most of practical use cases.
Okay. If we implement this then I would deprecate @orderby
and scrap plans for the GroupedDataFrame
case.
I would still want where
to perform transformations by groups. That seems more essential. Thankfully there is no conflict with base.
As mentioned in DataFramesMeta here. For consistency with Base
and filter
it might make sense to sort
transformations to operate on rows. And sort
to operate on grouped data frames by sorting groups.
Then we can use orderby
to act on columns, the where
to sort
s filter
.
I'm not sure this is something we need to implement in the near future but its a logically consistent framework imo.
Given https://github.com/JuliaData/DataFrames.jl/pull/2652 we can support any transformation conforming to select
. In this select
transformations would operate on whole columns. Would this be OK?
CC @nalimilan
one problem is that this would create a new column in the data frame, right?
Is there some extra step we can use to delete the created column?
EDIT: It looks like it doesn't do that in unique
which is good.
It would not. We would do sortperm
of a temporary data frame
I say go for it. This will bring more consistency with the API and allow DataFramesMeta to deprecate @orderby
in favor of @sort
.
I think it would be nice to support this, but there's one gotcha: filter(:x => f, df)
currently applies operations row-wise. So we again face the same debate as at https://github.com/JuliaData/DataFrames.jl/issues/2323. Most of the time when sorting you'd apply a transformation to each row independently, so row-wise makes sense. But it could be confusing to remember that most functions take whole columns, but not sort
and filter
. This discussion also potentially affects map
if we want to support it one day.
I agree that filter
is problematic.
However, for sort
I find it natural that you pass what you want to sort, like now you pass :x
to sort on :x
, so whole-column approach is OK for me.
Also the difference is that for filter
to be fast we should not allocate whole temporary columns, which is - largely - not a problem for sort
. So, in summary, filter
and map
seem different (however, I find it disturbing that we use a different rule there; for filter
it is unfortunately probably too late to disallow the syntax).
Also it just occurred to me that filter
is different from all other transformations currently allowing selector => function
that that option is a first argument to the function (while in all other it is a last one). With map
it would also be first. While with sort
it will be last.
Also the difference is that for
filter
to be fast we should not allocate whole temporary columns, which is - largely - not a problem forsort
.
Are you sure? filter
does something which is almost exactly equivalent to ByRow
, since we need to allocate a Boolean column anyway to do the indexing. So a column-wise approach would be basically the same in terms of performance AFAICT.
Also it just occurred to me that
filter
is different from all other transformations currently allowingselector => function
that that option is a first argument to the function (while in all other it is a last one). Withmap
it would also be first. While withsort
it will be last.
Right. Though that's quite subtle and likely to confuse users.
Anyway, I'm OK to add column-wise transformations to sort
. But maybe we could also deprecate filter(:x => f, df)
in favor of subset(df, :x => ByRow(f))
just in case we don't want to keep it in the end. There's no conflict and no performance impact by default, so we could keep it for a long time, but at least we wouldn't commit to consider it as a recommended syntax until 2.0.
since we need to allocate a Boolean column anyway to do the indexing
Yes, I meant that it does not create intermediate temporary columns with data (but this is probably secondary, so we can ignore my comment)
deprecate
filter(:x => f, df)
Iet us discuss it in https://github.com/JuliaData/DataFrames.jl/issues/2654