DataStructures.jl icon indicating copy to clipboard operation
DataStructures.jl copied to clipboard

(Feature suggestion) SortedMultiSet

Open plut opened this issue 3 years ago • 1 comments

I think that this package is missing a SortedMultiSet type. As an example, let's say we want to collect a list of persons, sorted by age, as stored in an ages dictionary:

ages = Dict(:bob => 1, :sam => 2, :joe => 1)

and perform efficient insertions/deletions and searching on this list. My first reflex is using a SortedSet for this; however: SortedSet(keys(ages), Base.Order.By(x->ages[x])) merges the keys :bob: and :joe. (*)

There are two ways to work around this limitation right now:

  1. SortedSet((age, person) for (person, age) in ages);
  2. SortedMultiDict(age => person for (person, age) in ages).

However, both of them are somewhat memory-inefficient (they store information equivalent to a full copy of the ages database, as well as quite cumbersome to use (neither implementation allows a simple delete!(set, :bob) call). The second inconvenient may be fixed by writing a simple wrapper over either type (possibly the SortedMultiDict, which is somewhat more efficient), but the first one cannot be fixed except by writing a new structure.

(Also, keep in mind that the example I gave here is of course much simplified. In particular, the comparison values are numbers, and thus quite compact in memory. There could probably be some use cases where the comparison keys are big data structures, hence the fact that we don't want to copy all of them).

(*) in my view this is simply a bad design choice in the definition of SortedSet, because only equal values should be merged, i.e. a set should be allowed to be sorted according to a preorder, but it is likely much too late to change this.

plut avatar Dec 17 '21 12:12 plut

If I understand correctly, you are looking for a data structure with two fields that is searchable on either field. Your problem could be solved with a SortedMultiDict that maps age to name, and then a plain Dict or SortedDict that maps name to age, as well as an API that keeps the SortedMultiDict and the Dict consistent. (Actually, the Dict would need to map a name to an (age,semitoken) tuple so that the corresponding record in the SortedMultiDict could be found quickly.) The issue with putting a structure like this in the DataStructures package is that there are many variants of this set-up, and it's not clear to me how to accommodate all the variants while keeping the amount of code and documentation in the package manageable. Do you know of a canonical example of a doubly-indexed data structure in another programming language?

StephenVavasis avatar Dec 17 '21 14:12 StephenVavasis