DataFrames.jl sort missing data placement

sort missing data placement

Open chris-b1 opened this issue 4 years ago • 7 comments

Currently, missing is treated as the largest value in a sort. pandas has a na_position kwarg that lets you specify how missing data should be ordered, by default placing it last, regardless of ascending or descending sort.

Whether that default is correct (I like it, but could be familiarity) I do think having an option on sort/order to control missing placement would be nice.

DataFrames 0.21

julia> df = DataFrame(:a => [1.0, 9.0, 2.0, missing])
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 1.0      │
│ 2   │ 9.0      │
│ 3   │ 2.0      │
│ 4   │ missing  │

julia> sort(df, :a)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 1.0      │
│ 2   │ 2.0      │
│ 3   │ 9.0      │
│ 4   │ missing  │

julia> sort(df, :a, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ missing  │
│ 2   │ 9.0      │
│ 3   │ 2.0      │
│ 4   │ 1.0      │

pandas

df = pd.DataFrame({'a': [1.0, 9.0, 2.0, np.NaN]})

df
Out[42]: 
     a
0  1.0
1  9.0
2  2.0
3  NaN

df.sort_values('a')
Out[43]: 
     a
0  1.0
2  2.0
1  9.0
3  NaN

df.sort_values('a', ascending=False)
Out[44]: 
     a
1  9.0
2  2.0
0  1.0
3  NaN

df.sort_values('a', ascending=False, na_position='first')
Out[45]: 
     a
3  NaN
1  9.0
2  2.0
0  1.0

May 22 '20 11:05 chris-b1

@nalimilan - what do you think? We could add this, but it is against the design in Base.

May 22 '20 11:05 bkamins

Got it, yeah not sure how hard this should be special cased. For what it's worth below is an isless comparer that got me to what I needed (sort with rev=true and missing at the end)

isless_missingsmallest(::Missing, ::Any) = true
isless_missingsmallest(::Any, ::Missing) = false
isless_missingsmallest(::Missing, ::Missing) = false
isless_missingsmallest(x, y) = isless(x, y)

julia> sort(df, :a; lt=isless_missingsmallest)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ missing  │
│ 2   │ 1.0      │
│ 3   │ 2.0      │
│ 4   │ 9.0      │

julia> sort(df, :a; lt=isless_missingsmallest, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 9.0      │
│ 2   │ 2.0      │
│ 3   │ 1.0      │
│ 4   │ missing  │

May 22 '20 13:05 chris-b1

Do you have an actual use case for sorting missings first? I wonder whether it's worth adding more complexity to that function. Maybe Missings.jl could provide a custom isless function like the one you showed and groupby could allow passing it.

May 22 '20 13:05 nalimilan

The actual usecase is missings last, because I'm using a descending sort. This would be pretty typical for reporting-type tasks - e.g., "show sales by region, ranked from largest to smallest, with any region missing data at the bottom of the report"

Most of the time could probably work around by filling missings with 0s

May 22 '20 14:05 chris-b1

Ah yes it's indeed unfortunate that missing values are sorted first when reversing order. I guess adding a keyword argument could be justified then. We could have missinglast::Bool=!rev. Though you wouldn't be able to set it when passing order as they would conflict.

May 23 '20 14:05 nalimilan

I think this feature is useful. How about an optional keyword in sort such as skipmissing=true?

Nov 25 '20 17:11 nivupaiR

It sure can be added as an opt-in, with the current behaviour being the default. We just need to wrork out the details of the design.

Nov 25 '20 18:11 bkamins

How did this play out?

May 15 '23 02:05 russmack

As you can see it is marked for 1.x release. We did not get additional requests for this feature during the last three years, so it did not get a high priority.

The current solution for this requirement is:

julia> sort(df, rev=true, lt=(x, y) -> ismissing(x) ? true : ismissing(y) ? false : isless(x, y))
4×1 DataFrame
 Row │ a
     │ Float64?
─────┼───────────
   1 │       9.0
   2 │       2.0
   3 │       1.0
   4 │ missing

Essentially what we would need to add is a special variant of isless which considers missing as the smallest not as the largest. The question is what should be the name of this function?

May 15 '23 06:05 bkamins

I think I might be able to implement this. Would we handle this via a keyword argument, or a new variant of isless?

Jun 23 '23 09:06 alonsoC1s

@alonsoC1s - I think adding a function would be better. Also, it probably makes sense to have it in Missings.jl. The reason is that then it could be used in any sorting/permutation function in the whole ecosystem.

The question is what the name should be. The isless_missingsmallest seems too lengthy.

@nalimilan @LilithHafner - do you have opinions?

Jun 25 '23 08:06 bkamins

It would be nice to have this compose nicely with other orderings (e.g. use lt=⊂ and missingsmallest=true). Perhaps missingsmallest could be a higher order function with

missingsmallest(f) = (x,y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y)

Then the name would be missingsmallest(isless).

If efficiency is important, we should make the missingsmallest option dispatchable, though. Using an opaque lt function rules out most optimizations. This could look like

struct MissingSmallest{O<:Ordering} <: Ordering
    o::O
end

lt(::MissingSmallest, ::Missing, ::Missing) = false
lt(::MissingSmallest, ::Any, ::Missing) = false
lt(::MissingSmallest, ::Missing, ::Any) = true
lt(ms::MissingSmallest, x::Any, y::Any) = lt(ms.o, x, y)

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::MissingSmallest, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, o.o, (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, ReverseOrdering(o.fwd.o), (kw...; lo=new_lo))
    end
end

If this is implemented as a new Ordering, that would require a keyword argument or users creating their own Orderings.

The users creating their own orderings approach can be implemented in Missings.jl

Jun 25 '23 17:06 LilithHafner

Thank you for your nice ideas 😄.

If efficiency is important, we should make the missingsmallest option dispatchable, though.

I think efficiency is important.

Why do you think your approach with missingsmallest(f) would not be efficient? Because it introduces extra branching?

An approach with Ordering subtype indeed makes sense as an alternative. My questions would be:

why do you think it would be faster?
I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

Jun 25 '23 17:06 bkamins

why do you think it would be faster?

With an opaque lt ordering (e.g. missingsmallest(f)), we can get a plenty fast comparison based sort, and I imagine a non-adaptive comparison based sorting algorithm would perform identically on lt=missingsmallest(isless) and lt=isless but the sneaky tricks to optimize all the special cases don't apply. EDIT: yes they do, see @bkamins's comment.

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster (either branch free or simply very well predicted branches, I haven't checked) and in this case we can also use countsort or radixsort to sort the remaining values which are all Int64. ~With missingsmallest(f) we'd be stuck with vanilla ScratchQuickSort or InsertionSort.~

I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

Yeah. Usability-wise, I think sort!(df, missingssmallest=true) >> sort!(df, lt=missingssmallest(isless) ≈> sort(df, ord=MissingSmallest(Forward)). However, if we don't add missingsmallest to base, then sort!(v::Vector, missingsmallest=true) will not be an option. I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

Jun 25 '23 18:06 LilithHafner

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

I have pinged #internals on Slack. Let us see if someone has an opinion.

I think that the preference will be not to add this extra kwarg. The issue is that, in general, it could conflict with lt or by passed by user (that could treat missing in some different non-standard way). I think passing lt=missingssmallest(isless) is more composable.

Jun 25 '23 18:06 bkamins

See also previous discussion at https://github.com/JuliaData/Missings.jl/issues/142. There I had proposed lt=missingisless. lt=missingsmallest(isless) sounds a bit too complex for casual users, and "smallest" isn't really the vocabulary we use regarding sorting. Though maybe we could have both something like missingisless for convenience (possibly in Missings.jl) and something lower-level but more flexible and on which algorithms would dispatch.

Jun 25 '23 20:06 nalimilan

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

Yep! Good idea, you're right. There should be no performance loss from using the lt approach.

lt=missingsmallest(isless) seems like the soundest approach internally (up to renaming), but I don't see a problem with an alias. I like that Reverse === ReverseOrdering(ForwardOrdering()).

Bikeshed: we could call the higher order function missingsfirst

Jun 25 '23 21:06 LilithHafner

Current feeling - missingsfirst seems more intuitive than missingsless under rev = true and equally intuitive under rev = false.

Jun 25 '23 22:06 jariji

welll... I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end, which is the opposite of intuitive imo. We can't really have sort!(lt=X, rev=true) and sort!(lt=X, rev=false) put missing values in the same place because rev is applied by swapping the arguments to lt, so having the term first in X is probably a bad idea in retrospect. (unless someone can think of a way of implementing this that lets us have sort!(lt=missingsfirst(isless), rev=true) do the obvious thing of putting missing first and sorting everything else in reverse order)

Jun 25 '23 22:06 LilithHafner

I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end

I agree. I feel we need lt to define the standard order and rev=true to reverse it entirely. As opposed to SQL I think in Julia we do not treat missing in a special way in general (except for selected packages, but this functionality should be general).

So the conclusion of what is to do would be to have in Missings.jl:

Define missingssmallest(fun)
Define const missingsless = missingssmallest(isless)
Add documentation and make a release
Add special implementations of sorting algorithms if they encounter lt=missingsless (this can be a separate PR from the PR implementing the functionality)

The only thing is what names should we use (and this is usually hard).

Are there better alternatives than missingssmallest and missingless (as commented missingsfirst is unfortunately not good as we are specifying order not location)?

Jun 26 '23 06:06 bkamins

missingsleast would be more suitable than missingssmallest. missingless is fine too, its only downside being that it could be interpreted as meaning "without missings".

Jun 26 '23 06:06 jariji

Note that my suggestion was missingisless, not missingsless. :-)

Something that worries me if we support both lt=missingisless and lt=missingsmallest(isless) is that the two names are quite similar and people who don't get the subtlety of isless vs "smallest" could confuse them. Maybe we could merge them? With some tricks it should be possible to write lt=missingsmallest as a shorthand for lt=missingsmallest(isless).

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

Jun 26 '23 08:06 nalimilan

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

"least" goes with "less".

Jun 26 '23 17:06 jariji

I feel like this feature could be well recieved in the wider ecosystem, it might be worth it to propose it upstream (Missings.jl or perhaps even Base itself)

Jun 28 '23 13:06 alonsoC1s

@alonsoC1s - yes. This should be added in Missings.jl. I just keep the discussion here to have all things in one place. I understand you offered to implement it. If this is the case please feel free to open a PR in Missings.jl adding what is agreed. We still have to settle on the names, but these can be easily updated when we have the PR opened (and it would help to move the things forward). Thank you!

Jun 28 '23 15:06 bkamins

@alonsoC1s - are you planning to make a PR fo Missings.jl with this? (if not I can make it to move things forward)

Jul 15 '23 07:07 bkamins

Sorry, it totally slipped my mind. I am planning on it, it will be done in a few days at most

Jul 15 '23 14:07 alonsoC1s

I just opened PR https://github.com/JuliaData/Missings.jl/pull/144 at Missings.jl as a draft PR to work out the naming. I followed the roadmap set by @bkamins (i.e implementing the partial order function solution).

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

Jul 19 '23 21:07 alonsoC1s

Thank you. So I close this issue, and we can move the discussion to Missings.jl

Jul 19 '23 21:07 bkamins

Let's continue the discussion on naming given that it has already been developed here. I don't really like missingsless as it could mean "missings less", i.e. without missing values, and it uses the plural while missingsmallest uses the singular. Why not use missingisless, which I find clearer?

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

I guess this criticism applies to missingisless as well? missingslast is problematic as with sort(x, lt=missingslast), they would end up first instead.

@jariji proposed missingsleast. Do you think that's better? I'm not convinced it would be easier to remember or type for users. Among the options there's also missinglowest.

Jul 24 '23 09:07 nalimilan

DataFrames.jl DataFrames.jl copied to clipboard

sort missing data placement

DataFrames.jl
DataFrames.jl copied to clipboard