DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Consider allowing transformations in sort

Open bkamins opened this issue 3 years ago • 15 comments

It would have to be carefully thought about as sort has a complex API, but we could allow things like:

sort(df, :x => x -> x^2)

(so to allow to sort not only on columns but also on transformations)

bkamins avatar Oct 16 '20 09:10 bkamins

This would be great to have.

We still have the problem with GroupedDataFrame iterating through groups, so we get a conflict with Base.

pdeffebach avatar Oct 16 '20 17:10 pdeffebach

sort is defined only for AbstractDataFrame now only.

bkamins avatar Oct 16 '20 19:10 bkamins

But because a GroupedDataFrame iterates through groups, to keep a contract with Base, sort should sort groups.

pdeffebach avatar Oct 16 '20 19:10 pdeffebach

Yes, but no one asks for this functionality so it is better to leave it out (so that we have a free hand to work out the details later when we have real use-cases; I want to avoid adding functionality that no one uses for two reasons: a) we are not sure the design is OK, b) if it is not used then it means it is not tested). We have sort kwarg to sort by grouping columns and I guess it covers most of practical use cases.

bkamins avatar Oct 16 '20 19:10 bkamins

Okay. If we implement this then I would deprecate @orderby and scrap plans for the GroupedDataFrame case.

I would still want where to perform transformations by groups. That seems more essential. Thankfully there is no conflict with base.

pdeffebach avatar Oct 16 '20 19:10 pdeffebach

As mentioned in DataFramesMeta here. For consistency with Base and filter it might make sense to sort transformations to operate on rows. And sort to operate on grouped data frames by sorting groups.

Then we can use orderby to act on columns, the where to sorts filter.

I'm not sure this is something we need to implement in the near future but its a logically consistent framework imo.

pdeffebach avatar Oct 17 '20 21:10 pdeffebach

Given https://github.com/JuliaData/DataFrames.jl/pull/2652 we can support any transformation conforming to select. In this select transformations would operate on whole columns. Would this be OK?

CC @nalimilan

bkamins avatar Mar 11 '21 19:03 bkamins

one problem is that this would create a new column in the data frame, right?

Is there some extra step we can use to delete the created column?

EDIT: It looks like it doesn't do that in unique which is good.

pdeffebach avatar Mar 11 '21 20:03 pdeffebach

It would not. We would do sortperm of a temporary data frame

bkamins avatar Mar 11 '21 21:03 bkamins

I say go for it. This will bring more consistency with the API and allow DataFramesMeta to deprecate @orderby in favor of @sort.

pdeffebach avatar Mar 11 '21 21:03 pdeffebach

I think it would be nice to support this, but there's one gotcha: filter(:x => f, df) currently applies operations row-wise. So we again face the same debate as at https://github.com/JuliaData/DataFrames.jl/issues/2323. Most of the time when sorting you'd apply a transformation to each row independently, so row-wise makes sense. But it could be confusing to remember that most functions take whole columns, but not sort and filter. This discussion also potentially affects map if we want to support it one day.

nalimilan avatar Mar 11 '21 21:03 nalimilan

I agree that filter is problematic.

However, for sort I find it natural that you pass what you want to sort, like now you pass :x to sort on :x, so whole-column approach is OK for me.

Also the difference is that for filter to be fast we should not allocate whole temporary columns, which is - largely - not a problem for sort. So, in summary, filter and map seem different (however, I find it disturbing that we use a different rule there; for filter it is unfortunately probably too late to disallow the syntax).

bkamins avatar Mar 11 '21 22:03 bkamins

Also it just occurred to me that filter is different from all other transformations currently allowing selector => function that that option is a first argument to the function (while in all other it is a last one). With map it would also be first. While with sort it will be last.

bkamins avatar Mar 12 '21 09:03 bkamins

Also the difference is that for filter to be fast we should not allocate whole temporary columns, which is - largely - not a problem for sort.

Are you sure? filter does something which is almost exactly equivalent to ByRow, since we need to allocate a Boolean column anyway to do the indexing. So a column-wise approach would be basically the same in terms of performance AFAICT.

Also it just occurred to me that filter is different from all other transformations currently allowing selector => function that that option is a first argument to the function (while in all other it is a last one). With map it would also be first. While with sort it will be last.

Right. Though that's quite subtle and likely to confuse users.

Anyway, I'm OK to add column-wise transformations to sort. But maybe we could also deprecate filter(:x => f, df) in favor of subset(df, :x => ByRow(f)) just in case we don't want to keep it in the end. There's no conflict and no performance impact by default, so we could keep it for a long time, but at least we wouldn't commit to consider it as a recommended syntax until 2.0.

nalimilan avatar Mar 12 '21 13:03 nalimilan

since we need to allocate a Boolean column anyway to do the indexing

Yes, I meant that it does not create intermediate temporary columns with data (but this is probably secondary, so we can ignore my comment)

deprecate filter(:x => f, df)

Iet us discuss it in https://github.com/JuliaData/DataFrames.jl/issues/2654

bkamins avatar Mar 12 '21 15:03 bkamins