DataFrames.jl Sort on function of rows

Sort on function of rows

Open jlumpe opened this issue 4 years ago • 6 comments

It would be nice if there was a function to sort rows by some function of an entire DataFrameRow instead of by individual column values. An example implementation would be:

function permuterows!(df::DataFrame, p)
    for col in DataFrames._columns(df)
        permute!(col, p)
    end
    return df
end

function sortrows!(df::DataFrame; kw...)
    p = sortperm(eachrow(df); kw...)
    permuterows!(df, p)
    return df
end

Jun 26 '20 21:06 jlumpe

Thank you for raising this.

I would even say that the current implementation we have is somewhat inconsistent with the rest of the API, as it was established before data frame was decided to be a collection or rows. Ideally lt and by would work exactly as you propose in my opinion (but they just take one element instead of the whole row).

I am not sure what would be best to do. For now I mark it as post 1.0 functionality.

Jun 27 '20 07:06 bkamins

As @nalimilan pointed out we can use the pattern from filter in the future, e.g.

sort(df, by = AsTable(:) => fun)`

We would only have to sort out how to combine it then with lt, but it should be doable.

Jun 27 '20 11:06 bkamins

Since this is (probably) the only place where a function is applied to each cell in DataFrames, apart from broadcast, it would make sense to deprecate the current behavior so that we can use the standard pattern used for select and co.

Jun 27 '20 15:06 nalimilan

I was thinking about it. Maybe a less breaking option would be to leave by and lt "as is" (so with them we treat data frame like in broadcasting as a matrix), but add byrow and ltrow kwargs. byrow would process the row as a whole, and ltrow would take the the output of byrow for comparison. Then we would disallow mixing by/lt and byrow/ltrow (also if one uses byrow/ltrow then order would be disallowed and we would allow AsTable when specifying rows to pass NamedTuple to byrow for speed).

Jun 29 '20 14:06 bkamins

That would be the only place where we support a "*row" argument, so that would be weird.

Actually I realize that it's very likely that people only use that feature when sorting on a single column, right? To limit breakage, we could treat sort(df, cols, by=f) as sort(df, cols, by=cols => f), i.e. f would be called as f.(df[:, cols]...). That would be consistent both with the current API when cols is a single column, and with the general pattern used in select, etc. Though of course we would need a new syntax for when cols refers to multiple columns. Not sure how common it is to use it with a custom by or lt.

Jul 07 '20 15:07 nalimilan

+1 to this. I think it would be a great feature to have. I don't fully understand why we can't do

sort(df, :x => fun1, :y => fun2)

Where it sorts by the output of fun1(df.x) first and then by fun2(df.y) second.

Jul 20 '20 18:07 pdeffebach

DataFrames.jl DataFrames.jl copied to clipboard

Sort on function of rows

DataFrames.jl
DataFrames.jl copied to clipboard