julia
julia copied to clipboard
skipmissing on Dict: inconsistent results
I noticed a pretty weird and very hard to catch inconsistency:
julia> d = Dict(:x => 1, :y => 2, :z => 0)
julia> argmax(d)
:y
julia> argmax(skipmissing(d))
:z
No matter how one interprets dicts iteration and skipmissing semantics, this result is totally unexpected.
The return type is correct, so this may lead to very nasty bugs in user code.
Is that a bug in Julia? Something should probably throw an error here, if dicts + skipmissing are not supported...
Some discussion on slack: https://julialang.slack.com/archives/C67910KEH/p1674391331249859.
The underlying issue, if there is one, is that skipmissing just iterates a Dict, which gives us pairs of key => value. argmax then maximizes that pair and since the first element of the pair is the key, this ends up maximizing the key (always - the key is semantically unique due to the Dict after all). "Fixing" this behavior is tricky, because it either means
- changing the way dicts iterate (that's breaking, so we can't do that)
- change how
skipmissingiterates when wrapping aDict(possible, but very ugly, since it meansskipmissingneeds to be specialized for everything that doesn't just produce values when iterating). This also means making a decision about whatskipmissingon aDictshould mean - should it skip entries of the formmissing => foobar?
In an ideal world, iterating over a dict would require one of keys, values or pairs (which I'm also not entirely sure would fix this particular thing, but would prevent argmax(d) from working), but since we don't live in that world, another option is to disallow skipmissing(dict) entirely and suggest the use of skipmissing(keys(dict)), skipmissing(values(dict)), skipmissing(pairs(dict)). Additionally, this can be worked around by being explicit with what argmax should check, using argmax(last, dict) or argmax(first, dict) (this is consistent with skipmissing).
"Fixing" this behavior is tricky
Don't think it's really tricky: as I suggest in the first post, it would be great if something in argmax(skipmissing(d)) simply threw an error when passed a dict. The cleanest place to error seems skipmissing(::AbstractDict) = error(...), right?
Yes, that's more or less what I suggested in the last paragraph, giving a (hopefully helpful) error message instead. That may result in people complaining though, since it is a breaking change to something that used to work (albeit while doing the "wrong" thing..). There were situations in the past where a new error was introduced for things that were likely a mistake, but depending on what other people think, changing this is maybe too breaking :shrug: Either way, the workarounds to get the desired behavior exist and are mentioned above.
skipmissing(::AbstractDict) = error(...), right?
It may be better to deprecate instead of throwing an error.
How do you deprecate, if not by throwing an error? Unless you mean, throw a warning instead for a while? I'd argue if you're calling skipmissing on a Dict then your code is already broken.
I wonder if a method with signature skipmissing(f, c) could be useful, e.g. skipmissing(last, d) (or if key(p::Pair)=first(p); value(p::Pair)=last(p), then skipmissing(value, d))? skipmissing(identity, c) could be considered a fallback for skipmissing(c).
As an aside, argmax(f, c) has behavior which in this context is mildly counterintuitive; whereas argmax(d) returns a key, argmax(last, d) returns a Pair. I also don't see how it would solve the issue here.
My instinct was to try this, but [as we've seen many times] it doesn't work:
julia> findmax(last, (k=>v for (k,v) in d if !ismissing(v)))[2]
ERROR: MethodError: no method matching keys(::Base.Iterators.Filter{var"#8#10", Dict{Symbol, Int64}})
This works, but evaluation may be a little too eager for your tastes:
argmax(filter(!ismissing ∘ value, d))
This also feels like a problem for which transducers are probably appropriate, but this throws an error:
using Transducers
d |> Filter(!ismissing ∘ last) |> argmax
As an aside, I prefer Rich Hickey's convention of calling such transducers filtering rather than Filter, and mapping rather than map.
I was thinking about the warning using @deprecate for a while. It generates only warning and the code still runs, as in
https://github.com/JuliaLang/julia/blob/670190c3393e7a2193ad9e40cfa12ec643ceba75/base/deprecated.jl#L215
Thanks, TIL 😇
It may be better to deprecate instead of throwing an error.
Yes, a deprecation may be in order.
I also don't see how it would solve the issue here.
It solves the issue because its result is consistent with wrapping the dict in skipmissing. That it then returns a pair again is once again due to Dict iterating over pairs, which is exposed now instead of being silently hidden (& giving wrong results!) behind the iterating behavior of skipmissing over a Dict. I agree that this is in principle weird, but again, it's nothing that we can change now. Suggesting first(argmax(last, dict))) as a replacement seems ok to me, as a workaround to get the correct behavior.
The intent is to exclude keys whose corresponding values are missing, no?
julia> d = Dict(:x => 1, :y => missing, :z => 2);
julia> first(argmax(last, d))
:y
In this particular case, yes. Again, there is an inherent ambiguity due to how skipmissing, the iteration interface and Dict play with each other - after all, you can have missing => foobar in a Dict.
For sure. I don't understand how it's a suitable workaround to achieve the objectives of the OP though; it seems like first(argmax(last, d)) here is just a longer way to say argmax(d).
The map function already explicitly throws an error while attempting to use Dict without calling pairs. I find it very natural and it does not really affect the code much (one extra call to pairs or keys or values). However, this only shows that the current iterating behaviour of Dicts (while being useful in some cases) a potential foot gun for many other functions that do not expect Dict as an input but still work and silently produce wrong/unexpected answers.
It is possible of course to throw an error in skipmissing, map etc, but it only hides the issue and waits for the next potential incompatible function to be discovered. For example multi-argument version of map does not throw an error and sometimes produces weird results:
julia> d1 = Dict(:x => 100, :y => 1000)
Dict{Symbol, Int64} with 2 entries:
:y => 1000
:x => 100
julia> map(isless, d1) # looks like we are safe here
ERROR: map is not defined on dictionaries
julia> d2 = Dict(:z => -100, :t => -1000)
Dict{Symbol, Int64} with 2 entries:
:z => -100
:t => -1000
julia> map(isless, d1, d2) # 3-argument version of map works with Dicts...???? no error
2-element Vector{Bool}: # The result also looks quite weird, definitely not something I would expect
1
0
E.g. another example with filter might also be considered as unintuitive:
julia> d = Dict(:x => missing, :y => 2);
julia> filter(!ismissing, d)
Dict{Symbol, Union{Missing, Int64}} with 2 entries:
:y => 2
:x => missing
It has been noted, that changing the way Dict iterates is breaking and is not going to happen, but can it at least be something to consider for the Julia 2.0?
It's also interesting to note, that for the original example, NamedTuples, while being semantically equivalent in this case, are behaving different from Dicts:
julia> n = (y = 2, z = 0, x = 1)
(y = 2, z = 0, x = 1)
julia> argmax(n)
:y
julia> argmax(skipmissing(n))
:y
Follow-up. Some head-scratching, experimentation, and troubleshooting eventually found the workaround for a Dict to be:
first(argmax(last, Iterators.filter(!ismissing ∘ last, d)))
For comparison, the solution using Dictionary from Dictionaries.jl runs about 5x faster and took zero troubleshooting:
argmax(skipmissing(d))
Maybe best to deprecate iterate(::Dict).
there is an inherent ambiguity ... you can have
missing => foobarin aDict
On second thought, considering the purpose of skipmissing and contexts in which it's likely to be used, this doesn't seem like a meaningful ambiguity.
Maybe best to deprecate iterate(::Dict).
😕 Iterating dicts is extremely common so this doesn't feel completely thought through.
@KristofferC
That is understandable and perhaps indeed too breaking for Julia 1.x.
But is it a big issue to force people to write:
for (key, value) in pairs(dict)
...
end
instead of
for (key, value) in dict
...
end
in Julia 2.x?
It simply 7 extra characters that solve all associated ambiguity issues with iterating over Dict. In fact the first version seems more readable. There is no "right" iterator for the dict, unless you explicitly use pairs, keys or values.
That's why pairs exists, yes (and I think ultimately where the language is headed), but we nonetheless can't change the current behavior in any 1.x.
Python does that (since it otherwise contradicted their in predicate, which only checks keys), but in my experience, it is pretty bad not to pick a useful default for iterate.
best to deprecate
iterate(::Dict).
Sorry, I should've qualified such a controversial statement with a broader roadmap. Deprecating iterate(d::Dict) in favor of iterate(pairs(d)) would open up a path for the iteration interface to eventually return values, which is more consistent with how NamedTuples and other indexable objects iterate.
Granted, this is more meaningful if Dict is ordered, but I've begun to think that's preferable for Julia's target userbases anyway.