julia icon indicating copy to clipboard operation
julia copied to clipboard

skipmissing on Dict: inconsistent results

Open aplavin opened this issue 2 years ago • 19 comments

I noticed a pretty weird and very hard to catch inconsistency:

julia> d = Dict(:x => 1, :y => 2, :z => 0)

julia> argmax(d)
:y

julia> argmax(skipmissing(d))
:z

No matter how one interprets dicts iteration and skipmissing semantics, this result is totally unexpected.

The return type is correct, so this may lead to very nasty bugs in user code.

Is that a bug in Julia? Something should probably throw an error here, if dicts + skipmissing are not supported...

Some discussion on slack: https://julialang.slack.com/archives/C67910KEH/p1674391331249859.

aplavin avatar Jan 22 '23 13:01 aplavin

The underlying issue, if there is one, is that skipmissing just iterates a Dict, which gives us pairs of key => value. argmax then maximizes that pair and since the first element of the pair is the key, this ends up maximizing the key (always - the key is semantically unique due to the Dict after all). "Fixing" this behavior is tricky, because it either means

  • changing the way dicts iterate (that's breaking, so we can't do that)
  • change how skipmissing iterates when wrapping a Dict (possible, but very ugly, since it means skipmissing needs to be specialized for everything that doesn't just produce values when iterating). This also means making a decision about what skipmissing on a Dict should mean - should it skip entries of the form missing => foobar?

In an ideal world, iterating over a dict would require one of keys, values or pairs (which I'm also not entirely sure would fix this particular thing, but would prevent argmax(d) from working), but since we don't live in that world, another option is to disallow skipmissing(dict) entirely and suggest the use of skipmissing(keys(dict)), skipmissing(values(dict)), skipmissing(pairs(dict)). Additionally, this can be worked around by being explicit with what argmax should check, using argmax(last, dict) or argmax(first, dict) (this is consistent with skipmissing).

Seelengrab avatar Jan 22 '23 14:01 Seelengrab

"Fixing" this behavior is tricky

Don't think it's really tricky: as I suggest in the first post, it would be great if something in argmax(skipmissing(d)) simply threw an error when passed a dict. The cleanest place to error seems skipmissing(::AbstractDict) = error(...), right?

aplavin avatar Jan 22 '23 15:01 aplavin

Yes, that's more or less what I suggested in the last paragraph, giving a (hopefully helpful) error message instead. That may result in people complaining though, since it is a breaking change to something that used to work (albeit while doing the "wrong" thing..). There were situations in the past where a new error was introduced for things that were likely a mistake, but depending on what other people think, changing this is maybe too breaking :shrug: Either way, the workarounds to get the desired behavior exist and are mentioned above.

Seelengrab avatar Jan 22 '23 15:01 Seelengrab

skipmissing(::AbstractDict) = error(...), right?

It may be better to deprecate instead of throwing an error.

petvana avatar Jan 22 '23 21:01 petvana

How do you deprecate, if not by throwing an error? Unless you mean, throw a warning instead for a while? I'd argue if you're calling skipmissing on a Dict then your code is already broken.

I wonder if a method with signature skipmissing(f, c) could be useful, e.g. skipmissing(last, d) (or if key(p::Pair)=first(p); value(p::Pair)=last(p), then skipmissing(value, d))? skipmissing(identity, c) could be considered a fallback for skipmissing(c).

uniment avatar Jan 22 '23 23:01 uniment

As an aside, argmax(f, c) has behavior which in this context is mildly counterintuitive; whereas argmax(d) returns a key, argmax(last, d) returns a Pair. I also don't see how it would solve the issue here.

My instinct was to try this, but [as we've seen many times] it doesn't work:

julia> findmax(last, (k=>v for (k,v) in d if !ismissing(v)))[2]
ERROR: MethodError: no method matching keys(::Base.Iterators.Filter{var"#8#10", Dict{Symbol, Int64}})

This works, but evaluation may be a little too eager for your tastes:

argmax(filter(!ismissing ∘ value, d))

This also feels like a problem for which transducers are probably appropriate, but this throws an error:

using Transducers
d |> Filter(!ismissing ∘ last) |> argmax

As an aside, I prefer Rich Hickey's convention of calling such transducers filtering rather than Filter, and mapping rather than map.

uniment avatar Jan 22 '23 23:01 uniment

I was thinking about the warning using @deprecate for a while. It generates only warning and the code still runs, as in https://github.com/JuliaLang/julia/blob/670190c3393e7a2193ad9e40cfa12ec643ceba75/base/deprecated.jl#L215

petvana avatar Jan 22 '23 23:01 petvana

Thanks, TIL 😇

uniment avatar Jan 22 '23 23:01 uniment

It may be better to deprecate instead of throwing an error.

Yes, a deprecation may be in order.

I also don't see how it would solve the issue here.

It solves the issue because its result is consistent with wrapping the dict in skipmissing. That it then returns a pair again is once again due to Dict iterating over pairs, which is exposed now instead of being silently hidden (& giving wrong results!) behind the iterating behavior of skipmissing over a Dict. I agree that this is in principle weird, but again, it's nothing that we can change now. Suggesting first(argmax(last, dict))) as a replacement seems ok to me, as a workaround to get the correct behavior.

Seelengrab avatar Jan 23 '23 08:01 Seelengrab

The intent is to exclude keys whose corresponding values are missing, no?

julia> d = Dict(:x => 1, :y => missing, :z => 2);

julia> first(argmax(last, d))
:y

uniment avatar Jan 23 '23 09:01 uniment

In this particular case, yes. Again, there is an inherent ambiguity due to how skipmissing, the iteration interface and Dict play with each other - after all, you can have missing => foobar in a Dict.

Seelengrab avatar Jan 23 '23 10:01 Seelengrab

For sure. I don't understand how it's a suitable workaround to achieve the objectives of the OP though; it seems like first(argmax(last, d)) here is just a longer way to say argmax(d).

uniment avatar Jan 23 '23 10:01 uniment

The map function already explicitly throws an error while attempting to use Dict without calling pairs. I find it very natural and it does not really affect the code much (one extra call to pairs or keys or values). However, this only shows that the current iterating behaviour of Dicts (while being useful in some cases) a potential foot gun for many other functions that do not expect Dict as an input but still work and silently produce wrong/unexpected answers.

It is possible of course to throw an error in skipmissing, map etc, but it only hides the issue and waits for the next potential incompatible function to be discovered. For example multi-argument version of map does not throw an error and sometimes produces weird results:

julia> d1 = Dict(:x => 100, :y => 1000)
Dict{Symbol, Int64} with 2 entries:
  :y => 1000
  :x => 100

julia> map(isless, d1) # looks like we are safe here
ERROR: map is not defined on dictionaries

julia> d2 = Dict(:z => -100, :t => -1000)
Dict{Symbol, Int64} with 2 entries:
  :z => -100
  :t => -1000

julia> map(isless, d1, d2) # 3-argument version of map works with Dicts...???? no error
2-element Vector{Bool}:    # The result also looks quite weird, definitely not something I would expect
 1                        
 0

E.g. another example with filter might also be considered as unintuitive:

julia> d = Dict(:x => missing, :y => 2);

julia> filter(!ismissing, d)
Dict{Symbol, Union{Missing, Int64}} with 2 entries:
  :y => 2
  :x => missing

It has been noted, that changing the way Dict iterates is breaking and is not going to happen, but can it at least be something to consider for the Julia 2.0?

It's also interesting to note, that for the original example, NamedTuples, while being semantically equivalent in this case, are behaving different from Dicts:

julia> n = (y = 2, z = 0, x = 1)
(y = 2, z = 0, x = 1)

julia> argmax(n)
:y

julia> argmax(skipmissing(n))
:y

bvdmitri avatar Jan 23 '23 20:01 bvdmitri

Follow-up. Some head-scratching, experimentation, and troubleshooting eventually found the workaround for a Dict to be:

first(argmax(last, Iterators.filter(!ismissing ∘ last, d)))

For comparison, the solution using Dictionary from Dictionaries.jl runs about 5x faster and took zero troubleshooting:

argmax(skipmissing(d))

Maybe best to deprecate iterate(::Dict).

there is an inherent ambiguity ... you can have missing => foobar in a Dict

On second thought, considering the purpose of skipmissing and contexts in which it's likely to be used, this doesn't seem like a meaningful ambiguity.

uniment avatar Jan 26 '23 10:01 uniment

Maybe best to deprecate iterate(::Dict).

😕 Iterating dicts is extremely common so this doesn't feel completely thought through.

KristofferC avatar Jan 26 '23 12:01 KristofferC

@KristofferC That is understandable and perhaps indeed too breaking for Julia 1.x. But is it a big issue to force people to write:

for (key, value) in pairs(dict)
    ...
end

instead of

for (key, value) in dict
    ...
end

in Julia 2.x?

It simply 7 extra characters that solve all associated ambiguity issues with iterating over Dict. In fact the first version seems more readable. There is no "right" iterator for the dict, unless you explicitly use pairs, keys or values.

bvdmitri avatar Jan 26 '23 16:01 bvdmitri

That's why pairs exists, yes (and I think ultimately where the language is headed), but we nonetheless can't change the current behavior in any 1.x.

Seelengrab avatar Jan 26 '23 16:01 Seelengrab

Python does that (since it otherwise contradicted their in predicate, which only checks keys), but in my experience, it is pretty bad not to pick a useful default for iterate.

vtjnash avatar Jan 26 '23 18:01 vtjnash

best to deprecate iterate(::Dict).

Sorry, I should've qualified such a controversial statement with a broader roadmap. Deprecating iterate(d::Dict) in favor of iterate(pairs(d)) would open up a path for the iteration interface to eventually return values, which is more consistent with how NamedTuples and other indexable objects iterate.

Granted, this is more meaningful if Dict is ordered, but I've begun to think that's preferable for Julia's target userbases anyway.

uniment avatar Jan 26 '23 19:01 uniment