DataStructures.jl icon indicating copy to clipboard operation
DataStructures.jl copied to clipboard

[Tracking] MultiDict and SortedMultiDict API updates before v1.0

Open NHDaly opened this issue 4 years ago • 8 comments

I wanted to create this meta-issue to track all the API changes that we want to do for MultiDict, which are currently scattered around various other threads:

  • PR https://github.com/JuliaCollections/DataStructures.jl/pull/656 (MuttsDict constructors operate only on elements/pairs, not on the aggregated underlying view)
  • https://github.com/JuliaCollections/DataStructures.jl/issues/655 (Make iterate iterate over the pairs, not over the aggregated view)
    • WIP fix in https://github.com/JuliaCollections/DataStructures.jl/pull/676
  • https://github.com/JuliaCollections/DataStructures.jl/issues/654 (Add delete!(d, k, v) to delete by pair)
  • Exposition: https://github.com/JuliaCollections/DataStructures.jl/issues/522#issuecomment-675171633

And to use this thread to discuss other changes we might make.


Current suggested changes:

  • [ ] Make MultiDict API consistently operate over pairs, just like a normal Dict, except that multiple entries are allowed per key.
    • WIP fix in #676
  • [ ] Use a Set{T} instead of Vector{T} for the multi-values ([?] -- see discussion below)
    • WIP fix in #692

Open Questions:

  • [ ] Should a MultiDict be <: AbstractDict?
  • [ ] Should we include setindex!(md::MultiDict, k, v)? See https://github.com/JuliaCollections/DataStructures.jl/issues/666#issuecomment-716058978, below.

NHDaly avatar Aug 28 '20 18:08 NHDaly

Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.

StephenVavasis avatar Aug 28 '20 18:08 StephenVavasis

What do you think about using a Dict{K,Set{V}} instead of a Dict{K,Vector{V}} for storing the values?

Properties that I like:

  • This would make lookup by pair algorithmically optimal (prevents a linear scan)
    • For membership tests: (2=>3) in MultiDict(2=>3)
    • For delete by pair: delete!(md, 2, 3)
  • It fixes weird ordering problems with equality that we currently have:
    •  julia> MultiDict(1=>1, 1=>10) == MultiDict(1=>10, 1=>1)
       false
      

Properties that are maybe less good?:

  • This means you can't easily store duplicates of identical pairs, which you currently can:
    •  julia> MultiDict(1=>1, 1=>1)[1]
       2-element Array{Int64,1}:
        1
        1
      
    • But I don't know if this is something that we actually care about? For me, when I wanted to use MultiDicts, it's because of the ability to store multiple values per key, but I personally don't need the ability to store multiple identical values per key... Do others want this ability?
    • If we want to retain this ability, we could use a Dict{K, Set{Tuple{V, Int}}} to maintain a count for duplicate tuples, but then I don't really know what we should return if you index k -- would we return an iterator that duplicates the elements by their count?

After some quick checking, it looks like both Python's MultiDict and Golang's MultiMap support exact duplicates: https://pypi.org/project/multidict/

>>> d = multidict.CIMultiDict([("key", "one"), ("key2", "two"), ("key", 3)])
>>> d.items()
_ItemsView('key': 'one', 'key2': 'two', 'key': 3)

https://github.com/jwangsadinata/go-multimap

George: [Washington Bush Bush]

So I think we should probably stick with this here as well. However, for algorithmic reasons, I'm now thinking we should indeed then implement the backing Dict as a Dict{K, Set{Tuple{V, Int}}}.

This isn't exactly a breaking API change, except that if we do change to this, we may want to reconsider what we return from getindex(): returning maybe an iterator instead of a Vector (which users can of course always collect()).

NHDaly avatar Aug 28 '20 18:08 NHDaly

Since you are trying to clean this up before the 1.0 release, you may also want to compare the public API for MultiDict to that of SortedMultiDict to look for incompatibilities or possible improvements to either.

Oh, thanks, agreed! I hadn't actually even noticed that one. Yes, i think you're right.

NHDaly avatar Aug 28 '20 18:08 NHDaly

Oh, actually, to update what Golang does: they provide both a "setmultimap" package and a "slicemultimap" package, one of which backs their multimap with a Set, and the other with an array. Only the array-backed one supports duplicates:

  • https://github.com/jwangsadinata/go-multimap/blob/master/setmultimap/setmultimap.go
  • https://github.com/jwangsadinata/go-multimap/blob/master/slicemultimap/slicemultimap.go

It seems that a Set{Tuple{V,Int}} would get us the best of both worlds, but with slightly more bookkeeping overhead. I can see why they elected to offer both, but 🤷 i don't necessarily think we need to.

NHDaly avatar Aug 28 '20 18:08 NHDaly

@oxinabox here's another question:

  • [ ] Should a MultiDict be <: AbstractDict?

i can kind of go either way on this. I think I'm leaning towards "no," since they're different than a Dictionary in a few ways. But if I read the description here, I feel like it more less holds?:

help?> AbstractDict
search: AbstractDict AbstractDisplay

  AbstractDict{K, V}

  Supertype for dictionary-like types with keys of type K and values of type V. Dict, IdDict and other types are subtypes of this. An AbstractDict{K, V}
  should be an iterator of Pair{K, V}.

NHDaly avatar Sep 24 '20 03:09 NHDaly

After thinking more about this, and comparing implementations in other languages, I've come around to thinking it should be <: AbstractDict. I believe the MultiDict is a type of dictionary; it has almost exactly the same interface, and behavior. The only weird thing is that it holds multiple elements per key, so getindex returns a set instead of a single element.

push!, pop!, and insert! are all basically the same. One weird thing is delete!(), where delete!(MultiDict(1=>2, 1=>3), 1) deletes more than one element, while delete!(MultiDict(1=>2, 1=>3), 1, 2) deletes only one, which is something to watch out for.

The main difference is that right now we're not supporting setindex!(md, k, v), I think maybe because the semantics are a bit confusing? It seems maybe weird that if you write md[1] = 2, then you can turn around and call md[1] and you'll get back something that's not 2 (you'll get back a set containing at least 2). Instead you have to call insert!(md, 1, 2). So I do think it makes sense to not include this function, to call attention to this behavior, but the question is whether this is "still a dictionary". And I think the answer is yes.

I'll also added a checkbox question for whether or not to include setindex!() here:

  • [ ] Should we include setindex!(md::MultiDict, k, v)? See above.

NHDaly avatar Oct 24 '20 21:10 NHDaly

As a bit of an outsider to this conversion, but as a person who has used this library before, I think there's actually a legitimate case for the existence of 3 different types related to the current MultiDict. See me perhaps as less of an experienced developer and more of a naive end-user.

Imagine we have input data, perhaps from a csv, in the form of 2 columns, keys and values:

Case 1 - Set

We want to know all unique values that occur for each specific key. A MultiDict backed by a Set works well for this.

Case 2 - MultiSet or Vector

We want to know all values that occur for each specific key, including duplicates. Both MultiSets and Vectors are good here, depending on specifics.

Case 2a - MultiSet specific

After accomplishing the previous, we want to check if a value is present for a key. A MultiSet-backed MultiDict make sense here.

Case 2b - Vector specific

If we don't need to ever check if a single value is present for a key, and we instead expect to iterate across all values for a key, we might instead see better performance from a Vector-backed MultiDict.

Case 3 - Vector

We want to know all unique values that occur for each specific key while keeping them in the original order that they were inserted in. A Vector-backed MultiDict is the logical choice here.

Proposal

Create abstract type AbstractMultiDict <: AbstractDict end.

Some methods, like insert!(md, k, v) can be shared by AbstractMultiDict.

Begin with 3 different MultiDicts that are respectively:

  • struct VectorDict <: AbstractMultiDict
  • struct SetDict <: AbstractMultiDict
  • struct MultiSetDict <: AbstractMultiDict

It is conceivable that end users may want to make their own MultiDicts in the future using different backing data structures such as CuArrays or whatever. It may be possible to address this with a more generalized struct MultiDict{ktype, vtype, btype} <: AbstractMultiDict.

Additional opinion on setindex!

I am of the opinion that it wouldn't even be weird for setindex!(md::AbstractMultiDict, k, v) to replace the entirety of its contents for k with whatever v it's given, coerced into the proper type for md in question.

I'm willing to implement this proposal in a pull request if anyone else thinks it's worth exploring.

quildtide avatar Nov 05 '20 03:11 quildtide

After a bit of experimentation and thinking about this overnight, I've come to this personal conclusion:

Scope of the MultiDict's Purpose

The main purpose of a MultiDict is to abstract over certain very common code patterns, such as:

md::Dict{K,C{V}}

for (key, value) in data
    if key in md
        insert!(md[key], value)
    else
        md[key] = C([value])
    end
end

In this case, it is useful to handle this common, but ugly, chunk of code with a more convenient function, such as insert!(md::MultiDict{K, V}, key::K, value::V).

Out of Scope

However, a good Julian library should not make assumptions about why the user is using the MultiDict past this common interface.

There are many use cases for these extremely common chunks of code that a MultiDict can provide a convenient interface over, but these different use cases benefit from different internal containers.

Some use cases favor the MultiSet; some favor the Vector; some favor the Set. Furthermore, an end-user might want a MultiDict using a Stack, Queue, or DiBitVector from this library, or they might want something from a different library such as a CuArray.

It's worth noting, however, that all of these possible backends, assuming proper behavior, support some sort of insert!, pop!, and in functions. The performance and exact mechanisms of these function methods is irrelevant to the core MultiDict implementation. If there are some common cases where optimizations are possible, multiple dispatch can be utilized.

Proposal

I believe that the MultiDict should leverage Multiple Dispatch to allow for maximal flexibility in the MultiDict and not assume what backing collection type the end user wants.

I.e. Instead of struct MultiDict{K,V} d::Dict{K,Vector{V}} end or struct MultiDict{K,V} d::Dict{K,MultiSet{V}} end, I propose:

struct MultiDict{K,C} 
   d::Dict{K,C} 
end

It has also come to my attention that C, the container type, might not even be a parametric type, in a really bizarre use-case.

The parametric nature of C is therefore not important to the function of MultiDict.

quildtide avatar Nov 05 '20 19:11 quildtide