swift-collections `BitArray`/`BitSet` need a way to export the underlying bitmaps in some sensible format

We need a good way to extract the underlying bitmaps out of a BitArray/BitSet value in a reasonably efficient (and reasonably elegant) way. We also need a way to then convert that data back into a BitArray/BitSet value.

BitArray and BitSet internally store their contents in what is effectively an array of UInt64, making it subject to endianness issues. This makes it unlikely we'd want them to expose their storage directly -- at least, not without overhauling these types to use a simpler representation. (Which is not out of the question, either.)

If we want to preserve the current representation, one possible move is to offer conversions to/from raw buffer pointers (or, preferably, RawSpan/OutputRawSpan).

extension BitArray { // And BitSet
  init(bitPattern: UnsafeRawBufferPointer)
  func exportBitPattern(initializing target: UnsafeMutableRawBufferPointer)

  init(bitPattern: RawSpan)
  func exportBitPattern(to target: inout OutputRawSpan)
}

(Names are tricky, as usual.)

These new APIs need to make a decision if they use a little- or big-endian representation for the byte buffers they operate on; alternatively, this can be a user-configurable input argument.

The most elegant way of doing this would be to change the underlying representation of these types, and simply add a var bitPattern: RawSpan property that directly exposes their storage with O(1) complexity. (That too would need to make a decision on endianness; and in this case that cannot be user-configurable.)

Jun 25 '25 20:06 lorentey

Re endianness, there is also the option (with pros and cons) of copying the approach used in the standard library integer types:
There, an Int simultaneously represents both an integer and its bit pattern in native endianness, and the two properties littleEndian and bigEndian are both of type Self: on a little endian system, a given integer i and i.littleEndian are the same, whereas i.bigEndian gives you a different value.

Applied to BitArray, one might imagine bigEndian and littleEndian operations that give you different bit arrays, with the underlying representation being accessible via a var bitPattern: RawSpan property. Such a design allows you avoid making any change in the underlying representation of the type itself, at a cost of elevating the bit width of the backing storage element type as a part of the user-visible API contract.

Jun 25 '25 23:06 xwu

That's a nice way to reuse our existing pattern! The drawback would be the cost of having to copy self to produce the result if byte reversal is needed; but then the storage would indeed be directly exposable. (Provided we also overhaul the internal representations.)

I think it would make sense to stick to a specific endianness choice here, no matter the preferences of the target arch. That would give us consistent/dependable performance across all platforms.

The usual ordering of bits within bytes is the up order, where the lowest-order bit is considered to be the first. This suggests a practical preference for "little-endian" byte order for multi-byte bitmaps -- I expect we'd want the first bit in the bitmap to be at the "first"/lowest bit of the first byte, and then continue with an orderly progression. The dual alternative would be to place the bitmap's logical first bit at the "last" (highest) bit of the last byte, which seems much messier, especially for bitmaps with odd sizes.

Then again, consistency is not an objective measure. BMP files are common; their 1-bit variant puts the logical first bit into the highest-order bit of the first byte; so they are using "little-endian" byte order, but they combine it with down-ordering for bits inside a byte. So the representation of a bitmap has more than a single design choice, and the axes seem to be independently configurable. (Even ignoring all the compression schemes.)

I'll need to revisit competing bitarray/bitset implementations to figure out if there is a clear preexisting convention for data structures like this. I expect there isn't; if so, I think we are free to go with whatever seems most convenient.

IIRC, the STL's std::bitset (which is actually more like a fixed-size ~bitarray~ bitvector, keeping to the tradition of C++ misnaming its constructs) provides no way to access its contents as a sequence of bytes.

Jun 26 '25 20:06 lorentey