swift-collections Sparse set

Sparse set

Open ejmarchant opened this issue 2 years ago • 11 comments

Description

Adds a sparse set data structure as suggested in issue #24

Detailed Design

A sparse set is a data structure which represents a set of non-negative integers. Operations such as member, add-member and delete-member are O(1) but this comes at the cost of increased storage space needed. See: An efficient representation for sparse sets (Briggs, Torczon, 1993).

Existing implementations:

There seem to be two implementation approaches:

SparseSet<T> where T is an (unsigned) integer type. A set-like interface where just integers are stored.
SparseSet<Key, Value> where Key is an (unsigned) integer type. A dictionary-like interface which can also hold associated values.

This is a dictionary-like implementation that closely follows the interface of OrderedDictionary. Main public interface (see source for symbol documentation).

public struct SparseSet<Key, Value> where Key : FixedWidthInteger, Key.Stride == Int {
  public var keys: ContiguousArray<Key> { get }
  public var values: Values { get set }

  public var universeSize: Int { get }
  public var isEmpty: Bool { get }
  public var count: Int { get }
  public func index(forKey key: Key) -> Int?
  public subscript(offset offset: Int) -> Element { get }

  public mutating func updateValue(_ value: Value, forKey key: Key) -> Value?
  public mutating func modifyValue<R>(forKey key: Key, default defaultValue: @autoclosure () -> Value, _ body: (inout Value) throws -> R) rethrows -> R
  public mutating func removeValue(forKey key: Key) -> Value?

  public subscript(key: Key) -> Value? { get set }
  public subscript(key: Key, default defaultValue: @autoclosure () -> Value) -> Value { get set }

  public mutating func merge<S>(_ keysAndValues: S, uniquingKeysWith combine: (Value, Value) throws -> Value) rethrows where S : Sequence, S.Element == (key: Key, value: Value)
  ...

  public func filter(_ isIncluded: (Element) throws -> Bool) rethrows -> SparseSet.SparseSet<Key, Value>
  public func mapValues<T>(_ transform: (Value) throws -> T) rethrows -> SparseSet<Key, T>
  public func compactMapValues<T>(_ transform: (Value) throws -> T?) rethrows -> SparseSet<Key, T>
}

The keys and values views, and the multiple subscripts for keys and indices are similar to how OrderedDictionary is implemented.

Compared to OrderedDictionary:

Pros:

Better performance for deletions, lookups, insertions (see below).

Cons:

More storage required. The storage requirements are O(u) where u is the largest integer key we will be using, plus O(n) where n is the number of elements in the set. (Roughly u * MemoryLayout<Key>.stride for the sparse storage plus n * (MemoryLayout<Key>.stride + MemoryLayout<Value>.stride) for the dense storage.) Typically key types would be: UInt8, UInt16, UInt32.
Less choice of key types and key values.
Although a sparse set is ordered and can be sorted, deletions will not preserve the ordering of elements.

Testing

Tests are included for most code paths.

Performance

Benchmarks are included (SparseSetBenchmarks), but I haven't updated Library.json to include any. Here are some samples I ran:

Lookups: 04a subscript lookups

Insertions: 04c subscript insert-append

Deletions: 04d subscript remove

Checklist

[x] I've read the Contribution Guidelines
[x] My contributions are licensed under the Swift license.
[x] I've followed the coding style of the rest of the project.
[ ] I've added tests covering all new code paths my change adds to the project (if appropriate).
[x] I've added benchmarks covering new functionality (if appropriate).
[x] I've verified that my change does not break any existing tests or introduce unexplained benchmark regressions.
[ ] I've updated the documentation if necessary.

Aug 13 '21 18:08 ejmarchant

Adding a separate set implementation (akin to OrderedSet instead of OrderedDictionary) as well would be fairly straightforward, but we'd need a naming scheme. E.g. could use SparseSet<T> for the set-like implementation and SparseDictionary<Key, Value> for the dictionary-like implementation but I'm not sure about the SparseDictionary as a name since I can only find SparseSet in the literature and in existing implementations. Also LLVM has a version of a sparse set where multiple values can be stored per key which they call SparseMultiSet, which we'd need to keep in mind for naming if we want to implement something similar later.

Aug 13 '21 18:08 ejmarchant

@swift-ci test

Aug 18 '21 01:08 lorentey

(Note: I have a weird personal hangup about the "sparse" qualifier -- it misleads me into thinking the representation might be space efficient, like is usually the case with, say, sparse matrices. But sparse sets are the opposite of space efficient! 😝

Not only you I can say, I was actually surprised that it was not :-) !

Googles hash implementation "sparsehash" was exactly space efficient (and fast) while "densehash" was faster (and space inefficient).

Aug 18 '21 05:08 hassila

Hello, I'm the original author of LLVM's SparseMultiSet and I've always wanted this functionality presented in Swift, but there's some design challenges I'm sure you're facing. I haven't thought about this in years and some of my knowledge might be out of date, but hopefully I can help.

LLVM's implementation allows specification of a functor to map keys to the natural numbers, which IIRC was actually a template parameter. Have you thought about how to provide this kind of customization?

LLVM's implementation allows specification of the bit-width of the sparse array members, using striding to access elements from the dense vector. E.g. UInt8 can still refer to more than 256 dense vector members by having strided access checks (keys are stored alongside values in the dense vector). While "technically linear", this is super useful at limiting the size impact when you know that common instances will be smallish. And, realistically, n / 2^16 behaves more like a constant factor than a linear factor in practice.

For SparseMultiSet, I went ahead and made the dense vector a kind of custom allocator, since it naturally tended towards building up an in-line free list. But I haven't thought about how that would play out for Swift, which is pickier about uninitialized-yet-stored values. What were you thoughts here?

What protocols should we be conforming to? Collection or SetAlgebra? Something more fundamental?

IIRC, deletions violating insertion order was a minor (and useful) optimization in LLVM's implementation. But there are many potential ways of performing deletions. Have you thought about preserving insertion order?

Aug 18 '21 17:08 milseman

Hello, I'm the original author of LLVM's SparseMultiSet and I've always wanted this functionality presented in Swift, but there's some design challenges I'm sure you're facing. I haven't thought about this in years and some of my knowledge might be out of date, but hopefully I can help.

Thanks for the insights, lots to think about for sure. My C++ knowledge is fairly basic so I don't understand too much of the details of the LLVM implementations of SparseSet and SparseMultiSet - please correct me if I've misunderstood your points.

LLVM's implementation allows specification of a functor to map keys to the natural numbers, which IIRC was actually a template parameter. Have you thought about how to provide this kind of customization?

We could constrain the key type (for a dictionary-like implementation) or the element (for a set-like implementation) with protocols. E.g.

protocol IntConvertible: Equatable {
    var intValue: Int { get }
    init(_ value: Int)
}

protocol IndexIdentifiable {
    associatedtype IndexType: IntConvertible
    var indexId: IndexType { get }
}

struct SparseSet<Element: IndexIdentifiable> {}

struct SparseDictionary<Key: IndexIdentifiable, Value> {}

or use the existing Identifiable protocol if possible as @lorentey mentioned above. Default implementations can be provided for UInt32 etc.

LLVM's implementation allows specification of the bit-width of the sparse array members, using striding to access elements from the dense vector. E.g. UInt8 can still refer to more than 256 dense vector members by having strided access checks (keys are stored alongside values in the dense vector). While "technically linear", this is super useful at limiting the size impact when you know that common instances will be smallish. And, realistically, n / 2^16 behaves more like a constant factor than a linear factor in practice.

This sounds like a good thing to add.

For SparseMultiSet, I went ahead and made the dense vector a kind of custom allocator, since it naturally tended towards building up an in-line free list. But I haven't thought about how that would play out for Swift, which is pickier about uninitialized-yet-stored values. What were you thoughts here?

Sorry, I don't have enough knowledge in this area to give any insights.

What protocols should we be conforming to? Collection or SetAlgebra? Something more fundamental?

The same as OrderedSet and OrderedDictionary I imagine:

SparseSet<Element>:

Conforms to RandomAccessCollection.
Partial conformance to MutableCollection and RangeReplaceableCollection since we can't insert or replace arbitrary elements as that could cause duplicates.
Partial conformance to SetAlgebra since we can't (sensibly) satisfy all the axioms (e.g. A ⊆ B ⇒ A ∪ B = B).
Provides a view of the elements, unordered, conforming to SetAlgebra as in OrderedSet.

SparseDictionary<Key, Value>:

Conforms to Sequence. We'd like to conform to RandomAccessCollection but Collection requires an index-based subscript which would conflict with our key-based subscript. Instead we provide a view (elements) - see below.
Partial conformance to MutableCollection and RangeReplaceableCollection.
Provides a view of the key-value pairs, elements, conforming to RandomAccessCollection.
Provides a view of the keys, keys, conforming to RandomAccessCollection.
Provides a view of the values, values, conforming to RandomAccessCollection and MutableCollection.

IIRC, deletions violating insertion order was a minor (and useful) optimization in LLVM's implementation. But there are many potential ways of performing deletions. Have you thought about preserving insertion order?

I was under the impression that deletion by filling the gap with the final element is rather fundamental for iteration performance - as opposed to being a minor optimization. What other deletion techniques were considered in LLVM for SparseSet?

Could you explain a bit about how iterations and deletions are handled in LLVM's SparseMultiSet? It looks like the dense storage slots contain value data as well as previous and next indices to implement a doubly linked list (or a doubly linked list for each key?). Doing a similar thing for SparseSet could preserve ordering upon deletion but would be impact iteration performance somewhat I guess.

I notice that entt has a deletion policy for it's sparse set implementation. Maybe someone familiar with C++ could have a look at that.

I was thinking we should refine a sparse set implementation before looking into sparse multi set. Is this a good strategy or are there details of sparse multi set that you think we should be considering as they might impact our implementation of sparse set?

Sep 03 '21 19:09 ejmarchant

protocol IntConvertible: Equatable {
    var intValue: Int { get }
    init(_ value: Int)
}

protocol IndexIdentifiable {
    associatedtype IndexType: IntConvertible
    var indexId: IndexType { get }
}

Adding a protocol is a Big Deal -- we can do it if it makes sense to do so, but there must be a really, really good reason for it.

In this case, it seems to me we might be able to get away with simply requiring Element to be something like Identifiable where ID: FixedWidthInteger. (In practice this will effectively require Element to always be a custom wrapper type; however, having to create such wrappers doesn't seem that big of a usability obstacle -- or if it is, then it applies equally to our existing collections.) If we do use Identifiable, then we need to document the expected nature of the identifiers -- which could be as simple as requiring each element to have a unique & stable id within the collection. (Or we could go further and require elements to be Equatable, with a == b iff a.id == b.id.)

Sep 04 '21 01:09 lorentey

Ah, I thought that might be the case - we can probably use Identifiable where ID: FixedWidthInteger like you say.

It does seem like a bigger usability obstacle compared to the other existing collections though. For example UInt16 doesn't conform to Identifiable so would not be usable as an element in SparseSet without wrapping, but does conform to Hashable so can be used as-is in OrderedSet. It would be unfortunate to have the situation:

let a: OrderedSet<UInt16> = [1, 2, 3, 4] // OK
let b: SparseSet<UInt16> = [1, 2, 3, 4] // Invalid

Sep 05 '21 00:09 ejmarchant

It seems to me SparseSet probably wouldn't be the best choice for storing small integers though -- a fixed-size BitSet would be generally faster* & use far less memory.

(Apart from SparseSet's O(1) initialization in case we can leave the sparse buffer uninitialized, which I suspect not to be the case in general.)

Sep 05 '21 06:09 lorentey

Yes, other data structures are probably more suitable than SparseSet for small universe sizes in general. There is some discussion in Briggs & Torczon of the larger actual cost of the O(1) operations in SparseSet versus a bitset for example. Even so it seems that SparseSet can still be competitive for smallish universe sizes depending on the frequency of forall and clear operations as well as the density of the set.

(I wasn't suggesting that SparseSet would be a good choice for small universes in my previous comment - I reworded and removed the mention of UInt8 to make that a bit clearer hopefully.)

Sep 06 '21 01:09 ejmarchant

We could constrain the key type (for a dictionary-like implementation) or the element (for a set-like implementation) with protocols.

More specifically, we're interested in types that map to the natural numbers with a known max universe size. We can probably support universe sizes determined at runtime without much issue. I'm not sure what advantages a statically known universe size would give us, except perhaps allowing us different representations to choose from for e.g. small universes.

It might be the case that one type could conform in multiple ways, depending on application/domain. IIRC the C++ template parameter is just the mapping function. But, if we make people define wrapper types if they want the non-default mapping for a type, that's not too crazy either.

or use the existing Identifiable protocol

Probably not, or at least it doesn't seem sufficient for me. If the Identifiable conformance isn't a (compact) mapping to the natural numbers, it's advantageous to map the ids to the nats compactly via another function. If the mapping itself is sparse, then that's hugely wasteful, and UInt.max cannot be a universe size if we're allocating the sparse array.

Even so it seems that SparseSet can still be competitive for smallish universe sizes depending on the frequency of forall and clear operations as well as the density of the set.

Yup, those are very useful properties. Clear is just endIndex = startIndex and iteration is as fast/efficient as possible. Key-based access (assuming mapping to the nats and any theoretical striding have low coefficients) should still be faster than a hash table.

It might be interesting to store the keys separately from the values in the dense vector, to increase value density. But I've never explored those kinds of ideas (perhaps @lorentey knows the tradeoffs).

Sep 14 '21 00:09 milseman

Could you explain a bit about how iterations and deletions are handled in LLVM's SparseMultiSet?

LLVM's SparseSet is a pretty specialized data structure, and SparseMultiSet even more so. A key lookup gives you an iterator which can be used to traverse and update the values for that key (but not values for other keys). Values for all keys are stored flat in the dense vector, but they have a prev/next index to represent the chaining. IIRC it's circular backwards but terminated forwards, because getting to the back from the front is very common (push/pop, etc). Removing a value from that list effectively updates the prev->next and next->prev to skip the value.

The value becomes a tombstone and the new head of the free list (which is singly linked). I recall there being good reasons for not swapping with the last element, but I don't recall what they all were precisely. Off the top of my head, it is considerably more complex to swap with the last element, since that element might be part of a different key's chain, meaning you're unlinking and relinking from multiple chains. This would cause an invalidation of iterators derived from unrelated keys (even a seemingly read-only use of an iterator), which would be surprising and is probably the main reason for the free list.

SMS is effectively a replacement for vector<vector<Stuff>> (actually, it was literally a replacement for that in the first place). Each key's value chain is used like a list/vector that will be pushed/popped from. It provides a "distinct" growable data structure per key, so it makes sense to support operating over each "distinct" data structure independently without spooky action at a distance.

I think there's also strong perf reasons in practice. Common programming patterns might have lots of little erase/inserts. Since new tombstones are the new head of the free list, we're reusing the same slots with minimal bookkeeping and without touching more cache lines. Even ignoring that, skipping the extra bookkeeping that SMS requires to move an entry in the dense vector adds up.

Iteration in a multi-set is iteration over the values for a given key. While we might suffer from less density in to the presence of tombstones, normal iteration never directly encounters one. If you wanted to iterate over all values, regardless of keys, it's still very fast to skip over tombstones (common strategy in LLVM data structures) and probably still the better choice for performance vs doing extra bookkeeping on every removal.

SMS could totally support an explicit garbage-collect operation that would in one pass remove all the tombstones and update relevant book-keeping for any swapped values (probably more efficiently all at once than amortized too). This would improve iteration over all stored values and improve density, depending on how many tombstones there were. I don't believe this functionality was ever needed in practice.

Sep 14 '21 22:09 milseman

swift-collections swift-collections copied to clipboard

Sparse set

Description

Detailed Design

Testing

Performance

Checklist

swift-collections
swift-collections copied to clipboard