DataFrames.jl
DataFrames.jl copied to clipboard
Sort on function of rows
It would be nice if there was a function to sort rows by some function of an entire DataFrameRow
instead of by individual column values. An example implementation would be:
function permuterows!(df::DataFrame, p)
for col in DataFrames._columns(df)
permute!(col, p)
end
return df
end
function sortrows!(df::DataFrame; kw...)
p = sortperm(eachrow(df); kw...)
permuterows!(df, p)
return df
end
Thank you for raising this.
I would even say that the current implementation we have is somewhat inconsistent with the rest of the API, as it was established before data frame was decided to be a collection or rows. Ideally lt
and by
would work exactly as you propose in my opinion (but they just take one element instead of the whole row).
I am not sure what would be best to do. For now I mark it as post 1.0 functionality.
As @nalimilan pointed out we can use the pattern from filter
in the future, e.g.
sort(df, by = AsTable(:) => fun)`
We would only have to sort out how to combine it then with lt
, but it should be doable.
Since this is (probably) the only place where a function is applied to each cell in DataFrames, apart from broadcast
, it would make sense to deprecate the current behavior so that we can use the standard pattern used for select
and co.
I was thinking about it. Maybe a less breaking option would be to leave by
and lt
"as is" (so with them we treat data frame like in broadcasting as a matrix), but add byrow
and ltrow
kwargs. byrow
would process the row as a whole, and ltrow
would take the the output of byrow
for comparison. Then we would disallow mixing by
/lt
and byrow
/ltrow
(also if one uses byrow
/ltrow
then order
would be disallowed and we would allow AsTable
when specifying rows to pass NamedTuple
to byrow
for speed).
That would be the only place where we support a "*row" argument, so that would be weird.
Actually I realize that it's very likely that people only use that feature when sorting on a single column, right? To limit breakage, we could treat sort(df, cols, by=f)
as sort(df, cols, by=cols => f)
, i.e. f
would be called as f.(df[:, cols]...)
. That would be consistent both with the current API when cols
is a single column, and with the general pattern used in select
, etc. Though of course we would need a new syntax for when cols
refers to multiple columns. Not sure how common it is to use it with a custom by
or lt
.
+1 to this. I think it would be a great feature to have. I don't fully understand why we can't do
sort(df, :x => fun1, :y => fun2)
Where it sorts by the output of fun1(df.x)
first and then by fun2(df.y)
second.