DataStructures.jl
DataStructures.jl copied to clipboard
Add `MultiSet{T}()`, built on `Accumulator{T,Int}()`.
A MultiSet{T}
is very much like an Accumulator{T,Int}
, but it
exposes a single-value-based interface. The fact that it is storing
counts under the hood is completely hidden from the interface.
For example, iterating the MultiSet returns each item in the MultiSet, repeated per its count, as if the multiset actually contained duplicate items:
julia> show(collect(MultiSet([1,1,1,2,2])))
[2, 2, 1, 1, 1]
whereas the accumulator presents its count-based representation to the user:
julia> show(collect(MultiSet([1,1,1,2,2]).data))
[2 => 2, 1 => 3]
For more info on MultiSets, see: https://en.wikipedia.org/wiki/Multiset
I've opened this PR as part of the MultiDict
rewrite. In order to get the iteration semantics correct for the MultiDict, I essentially had to implement iteration of a MultiSet, and the rest of the functions were simple interface additions to make this real. So I figured it's nice enough to put this together since it goes together with the MultiDict, i think! :)
Should we remove some of the multiset operators from the accumulator?
(Not reviewed yet)
I'm not sure. :) I know you said that Accumulators "basically are" multisets, which is true, and i'm happy to keep it that way if people want to use them for that purpose? But I think as a user-facing API the MultiSet makes more sense since it only exposes the elements, which is why I opened this PR.
BUT I guess i'm not sure if that means we should delete the multiset parts of the Accumulator. I'm supportive of deleting them, yeah, if you think it's cleaner to just present this as the main "multi-set" datastructure we support? Deleting code always makes everyone happy, I think! :)
But to be clear, i don't want to like step on other peoples' toes if that other stuff is useful too!
Sorry for slooow reply on all things.
BUT I guess i'm not sure if that means we should delete the multiset parts of the Accumulator. I'm supportive of deleting them, yeah, if you think it's cleaner to just present this as the main "multi-set" datastructure we support? Deleting code always makes everyone happy, I think! :)
Yes, I think if we did this we should remove all multiset behavour from Accumulator. Question is do we want to do this?
So main difference is that MultiSet
would not be a AbstractDict
yes?
Would it still allow fractional membership?
I guess not, though i don't know how useful that is.
Sorry for slooow reply on all things.
Hehe no worries. sorry for my slow coding here in the first place.
What does fractional membership mean? I don't see it in the docs: https://juliacollections.github.io/DataStructures.jl/latest/accumulators/
Ah, i guess it's referring to this from the docs?:
If the count type is not an integer but a more general real number, then this is a form of fuzzy multiset.
Mmm, then yes those are not supported. You could still use an Accumulator for that, i think.
So main difference is that MultiSet would not be a AbstractDict yes?
Yeah. The MultiSet looks and behaves like an AbstractSet. It has insert/delete, membership checks (in
), and the set functions setdiff/union/intersect. And iteration is elementwise, instead of via the dictionary interface (10,10,20
instead of 10=>2, 20=>1
).
Now that i've read that aloud, it probably should also have some kind of membership count? Something that would be a more efficient version of count(==(20), MultiSet([10,10,20]))
that wouldn't have to iterate every element.
Although it looks like the Go MultiSet doesn't have that, so maybe it's fine:
https://godoc.org/github.com/soniakeys/multiset
Python's MultiSet manages this by providing both a Set-like and a Dict-like interface, so that you can get the count via ms[key]
, but still iterate all the elements without their counts like I did, here. That seems a bit awkward, IMO, but maybe we could do something similar. If we did do something similar, I would probably opt to have ms[key]
return [key]*count
? Then users could always do length(ms[key])
to get the counts? 🤷 I'm not sure though. I see why GoLang decided to just leave this off. 😅
As a potential user, a way to get the count for a given item would be useful to me. Offering only [key]*count
would be inefficient. I think multiset[element]
returning the count would be fine, but a dedicated function (for example getcount(multiset, element)
) may be a good alternative, too.
A potential problem for all choices is what to do if an element is missing: Raising an error or returning zero/an empty vector? It is probably good to offer both.
I am also interested in random sampling from a multiset (issue #706). For that purpose, it would be useful to get the total size of the set more quickly than sum(values(multiset.data))
. It can be achieved in constant time by adding a .count
member to the struct, at the cost of updating it whenever the data changes.
Finally, I noticed that the following TODO is fixed by using reset!
: https://github.com/JuliaCollections/DataStructures.jl/pull/691/files#diff-a2e2216e3bd035013d330caf218a04b6760a379d6f0dd6df02be22bee2a131c6R72-R73