DataStructures.jl
DataStructures.jl copied to clipboard
[Tracking] MultiDict and SortedMultiDict API updates before v1.0
I wanted to create this meta-issue to track all the API changes that we want to do for MultiDict
, which are currently scattered around various other threads:
- PR https://github.com/JuliaCollections/DataStructures.jl/pull/656 (MuttsDict constructors operate only on elements/pairs, not on the aggregated underlying view)
- https://github.com/JuliaCollections/DataStructures.jl/issues/655 (Make
iterate
iterate over the pairs, not over the aggregated view)- WIP fix in https://github.com/JuliaCollections/DataStructures.jl/pull/676
- https://github.com/JuliaCollections/DataStructures.jl/issues/654 (Add
delete!(d, k, v)
to delete by pair) - Exposition: https://github.com/JuliaCollections/DataStructures.jl/issues/522#issuecomment-675171633
And to use this thread to discuss other changes we might make.
Current suggested changes:
- [ ] Make MultiDict API consistently operate over pairs, just like a normal Dict, except that multiple entries are allowed per key.
- WIP fix in #676
- [ ] Use a
Set{T}
instead ofVector{T}
for the multi-values ([?] -- see discussion below)- WIP fix in #692
Open Questions:
- [ ] Should a
MultiDict
be<: AbstractDict
? - [ ] Should we include
setindex!(md::MultiDict, k, v)
? See https://github.com/JuliaCollections/DataStructures.jl/issues/666#issuecomment-716058978, below.
Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.
What do you think about using a Dict{K,Set{V}}
instead of a Dict{K,Vector{V}}
for storing the values?
Properties that I like:
- This would make lookup by pair algorithmically optimal (prevents a linear scan)
- For membership tests:
(2=>3) in MultiDict(2=>3)
- For delete by pair:
delete!(md, 2, 3)
- For membership tests:
- It fixes weird ordering problems with equality that we currently have:
-
julia> MultiDict(1=>1, 1=>10) == MultiDict(1=>10, 1=>1) false
-
Properties that are maybe less good?:
- This means you can't easily store duplicates of identical pairs, which you currently can:
-
julia> MultiDict(1=>1, 1=>1)[1] 2-element Array{Int64,1}: 1 1
- But I don't know if this is something that we actually care about? For me, when I wanted to use MultiDicts, it's because of the ability to store multiple values per key, but I personally don't need the ability to store multiple identical values per key... Do others want this ability?
- If we want to retain this ability, we could use a
Dict{K, Set{Tuple{V, Int}}}
to maintain a count for duplicate tuples, but then I don't really know what we should return if you indexk
-- would we return an iterator that duplicates the elements by their count?
-
After some quick checking, it looks like both Python's MultiDict and Golang's MultiMap support exact duplicates: https://pypi.org/project/multidict/
>>> d = multidict.CIMultiDict([("key", "one"), ("key2", "two"), ("key", 3)])
>>> d.items()
_ItemsView('key': 'one', 'key2': 'two', 'key': 3)
https://github.com/jwangsadinata/go-multimap
George: [Washington Bush Bush]
So I think we should probably stick with this here as well.
However, for algorithmic reasons, I'm now thinking we should indeed then implement the backing Dict as a Dict{K, Set{Tuple{V, Int}}}
.
This isn't exactly a breaking API change, except that if we do change to this, we may want to reconsider what we return from getindex()
: returning maybe an iterator instead of a Vector (which users can of course always collect()
).
Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.
Oh, thanks, agreed! I hadn't actually even noticed that one. Yes, i think you're right.
Oh, actually, to update what Golang does: they provide both a "setmultimap" package and a "slicemultimap" package, one of which backs their multimap with a Set, and the other with an array. Only the array-backed one supports duplicates:
- https://github.com/jwangsadinata/go-multimap/blob/master/setmultimap/setmultimap.go
- https://github.com/jwangsadinata/go-multimap/blob/master/slicemultimap/slicemultimap.go
It seems that a Set{Tuple{V,Int}}
would get us the best of both worlds, but with slightly more bookkeeping overhead. I can see why they elected to offer both, but 🤷 i don't necessarily think we need to.
@oxinabox here's another question:
- [ ] Should a
MultiDict
be<: AbstractDict
?
i can kind of go either way on this. I think I'm leaning towards "no," since they're different than a Dictionary in a few ways. But if I read the description here, I feel like it more less holds?:
help?> AbstractDict
search: AbstractDict AbstractDisplay
AbstractDict{K, V}
Supertype for dictionary-like types with keys of type K and values of type V. Dict, IdDict and other types are subtypes of this. An AbstractDict{K, V}
should be an iterator of Pair{K, V}.
After thinking more about this, and comparing implementations in other languages, I've come around to thinking it should be <: AbstractDict
. I believe the MultiDict
is a type of dictionary; it has almost exactly the same interface, and behavior. The only weird thing is that it holds multiple elements per key, so getindex returns a set instead of a single element.
push!, pop!, and insert! are all basically the same. One weird thing is delete!()
, where delete!(MultiDict(1=>2, 1=>3), 1)
deletes more than one element, while delete!(MultiDict(1=>2, 1=>3), 1, 2)
deletes only one, which is something to watch out for.
The main difference is that right now we're not supporting setindex!(md, k, v)
, I think maybe because the semantics are a bit confusing? It seems maybe weird that if you write md[1] = 2
, then you can turn around and call md[1]
and you'll get back something that's not 2 (you'll get back a set containing at least 2). Instead you have to call insert!(md, 1, 2)
. So I do think it makes sense to not include this function, to call attention to this behavior, but the question is whether this is "still a dictionary". And I think the answer is yes.
I'll also added a checkbox question for whether or not to include setindex!()
here:
- [ ] Should we include
setindex!(md::MultiDict, k, v)
? See above.
As a bit of an outsider to this conversion, but as a person who has used this library before, I think there's actually a legitimate case for the existence of 3 different types related to the current MultiDict
. See me perhaps as less of an experienced developer and more of a naive end-user.
Imagine we have input data, perhaps from a csv, in the form of 2 columns, keys
and values
:
Case 1 - Set
We want to know all unique values
that occur for each specific key
. A MultiDict backed by a Set
works well for this.
Case 2 - MultiSet or Vector
We want to know all values
that occur for each specific key
, including duplicates. Both MultiSet
s and Vector
s are good here, depending on specifics.
Case 2a - MultiSet specific
After accomplishing the previous, we want to check if a value
is present for a key
. A MultiSet
-backed MultiDict make sense here.
Case 2b - Vector specific
If we don't need to ever check if a single value
is present for a key
, and we instead expect to iterate across all values
for a key
, we might instead see better performance from a Vector
-backed MultiDict.
Case 3 - Vector
We want to know all unique values
that occur for each specific key
while keeping them in the original order that they were inserted in. A Vector
-backed MultiDict is the logical choice here.
Proposal
Create abstract type AbstractMultiDict <: AbstractDict end
.
Some methods, like insert!(md, k, v)
can be shared by AbstractMultiDict
.
Begin with 3 different MultiDicts that are respectively:
-
struct VectorDict <: AbstractMultiDict
-
struct SetDict <: AbstractMultiDict
-
struct MultiSetDict <: AbstractMultiDict
It is conceivable that end users may want to make their own MultiDicts in the future using different backing data structures such as CuArrays
or whatever. It may be possible to address this with a more generalized struct MultiDict{ktype, vtype, btype} <: AbstractMultiDict
.
Additional opinion on setindex!
I am of the opinion that it wouldn't even be weird for setindex!(md::AbstractMultiDict, k, v)
to replace the entirety of its contents for k
with whatever v
it's given, coerced into the proper type for md
in question.
I'm willing to implement this proposal in a pull request if anyone else thinks it's worth exploring.
After a bit of experimentation and thinking about this overnight, I've come to this personal conclusion:
Scope of the MultiDict's Purpose
The main purpose of a MultiDict is to abstract over certain very common code patterns, such as:
md::Dict{K,C{V}}
for (key, value) in data
if key in md
insert!(md[key], value)
else
md[key] = C([value])
end
end
In this case, it is useful to handle this common, but ugly, chunk of code with a more convenient function, such as insert!(md::MultiDict{K, V}, key::K, value::V)
.
Out of Scope
However, a good Julian library should not make assumptions about why the user is using the MultiDict past this common interface.
There are many use cases for these extremely common chunks of code that a MultiDict can provide a convenient interface over, but these different use cases benefit from different internal containers.
Some use cases favor the MultiSet
; some favor the Vector
; some favor the Set
. Furthermore, an end-user might want a MultiDict using a Stack
, Queue
, or DiBitVector
from this library, or they might want something from a different library such as a CuArray
.
It's worth noting, however, that all of these possible backends, assuming proper behavior, support some sort of insert!
, pop!
, and in
functions. The performance and exact mechanisms of these function methods is irrelevant to the core MultiDict implementation. If there are some common cases where optimizations are possible, multiple dispatch can be utilized.
Proposal
I believe that the MultiDict
should leverage Multiple Dispatch to allow for maximal flexibility in the MultiDict
and not assume what backing collection type the end user wants.
I.e. Instead of struct MultiDict{K,V} d::Dict{K,Vector{V}} end
or struct MultiDict{K,V} d::Dict{K,MultiSet{V}} end
, I propose:
struct MultiDict{K,C}
d::Dict{K,C}
end
It has also come to my attention that C
, the container type, might not even be a parametric type, in a really bizarre use-case.
The parametric nature of C
is therefore not important to the function of MultiDict
.