swift-collections icon indicating copy to clipboard operation
swift-collections copied to clipboard

Counted Set

Open Saklad5 opened this issue 2 years ago • 4 comments

Description

A new implicit data structure has been added: an unsigned multiset. This has been discussed extensively on the Swift Forums, most notably here and here.

Detailed Design

Since this is an implicit data structure, I have introduced conformance to RawRepresentable, where the RawValue is Dictionary<Element, UInt>. I have used UInt instead of Int to make it clearer that this is not a signed multiset, which may be added in the future.

A potential point of controversy: I have implemented CountedSet.underestimatedCount as the number of unique elements, since that is indeed guaranteed to be less than or equal to its cardinality.

CountedSet implements both Collection and SetAlgebra, among other protocols.

This is not quite ready for release yet, in my estimation, but the contribution guidelines encourage early pull requests, and the core functionality has been implemented.

Work needs to be done on the Codable implementation: in particular, it should be possible to decode a flat list of objects into a CountedSet, as one of the intended uses of the structure is to act as an unordered Array.

Representations for Mirror, String, and debugging also need to be customized.

Additional API may be warranted as well, since a lot of functionality (like the count of a given element) is currently accessible only by reading the underlying dictionary directly.

/// An unordered, counted multiset.
@frozen
public struct CountedSet<Element: Hashable>: RawRepresentable {
  // Allows internal setter to be referenced from inlined code
  @usableFromInline
  internal var _storage = RawValue()

  @inlinable @inline(__always)
  public var rawValue: [Element: UInt] { _storage }

  @inlinable
  public var isEmpty: Bool { rawValue.isEmpty }

  /// Creates an empty counted set with preallocated space for at least the
  /// specified number of unique elements.
  ///
  /// - Parameter minimumCapacity: The minimum number of elements that the
  ///   newly created counted set should be able to store without reallocating
  ///   its storage buffer.
  @inlinable
  public init(minimumCapacity: Int) {
    self._storage = .init(minimumCapacity: minimumCapacity)
  }

  @inlinable
  public init?(rawValue: [Element: UInt]) {
    guard rawValue.values.allSatisfy({ $0 > .zero }) else { return nil }
    _storage = rawValue
  }

Documentation

I’ve taken care to document all new API at the symbol-level, particularly where it may be confusing to users. However, some of it may benefit from additional usage examples and guidance.

I have not updated the Documentation folder guides yet.

Testing

There is complete test coverage for the Sequence and SetAlgebra implementations, including dedicated tests for each axiom mandated by SetAlgebra. Coverage for other elements is more spotty, but does exist.

Performance

I have not implemented performance benchmarks yet.

Source Impact

This is a purely additive change, so it does not break any existing API.

Checklist

  • [x] I've read the Contribution Guidelines
  • [x] My contributions are licensed under the Swift license.
  • [x] I've followed the coding style of the rest of the project.
  • [ ] I've added tests covering all new code paths my change adds to the project (to the extent possible).
  • [ ] I've added benchmarks covering new functionality (if appropriate).
  • [x] I've verified that my change does not break any existing tests or introduce unexpected benchmark regressions.
  • [ ] I've updated the documentation (if appropriate).

Saklad5 avatar Dec 04 '21 18:12 Saklad5

Thank you for the detailed feedback, I really appreciate it!

Saklad5 avatar Dec 09 '21 03:12 Saklad5

A proper multiset should provide a direct operation to query the multiplicity of a particular member. I.e., we need a generalization of contains that returns an integer instead of a boolean value.

I’ve come up with a few different candidates for how this should be expressed, and I’d like feedback on which to use.

A. CountedMultiset.multiplicity(of:) B. CountedMultiset.count(of:) C. CountedMultiset.subscript(_:)

I’m favoring A, personally.

Saklad5 avatar Dec 12 '21 20:12 Saklad5

It's nice to use the same term consistently where possible. Since this type is being called a counted multiset, it makes sense to me that we should be able to ask about the count of something.

xwu avatar Dec 12 '21 22:12 xwu

It's nice to use the same term consistently where possible. Since this type is being called a counted multiset, it makes sense to me that we should be able to ask about the count of something.

That’s true, but Swift usually uses count to mean cardinality, and that’s not what this would be. I’m concerned it could be easily confused with CountedMultiset.count, the sum of all multiplicities and the cardinality of the collection.

You raise an excellent point about the name, though: I was using CountedSet mainly because Foundation’s (terrible) implementation does. @lorentey said it should be named CountedMultiset instead, but now that I think about it that seems redundant at best and confusing at worst.

I don’t think there is any precedent for the name “counted multiset", and even “counted set" doesn’t seem to show up outside the context of Foundation and derivations upon it. The most common term in academia and other programming languages, by far, is “multiset”, followed by “bag”.

@lorentey: Instead of CountedSet or CountedMultiset, how about simply Multiset? That seems a lot clearer, and lends itself naturally to proposed variations like SortedMultiset or even SignedMultiset.

Saklad5 avatar Dec 12 '21 22:12 Saklad5