swift-collections SegmentedLine

I had a need for a data structure recently and couldn't find any good package implementations. I think it might be a nice addition to this package.

Essentially, the problem was that I needed to generate some Unicode data tables. This required a structure that allowed me to model the entire 21-bit space of Unicode Code Points, and essentially "paint" regions with data I parsed from the Unicode files. The second part was that I had to optimise the data - for example, of the dozen or so Bidi_Class values, I only needed data to group scalars in to 4 categories. Certain code points would be disallowed by earlier stages of the process and could be skipped, and I wanted to combine data sets from different files in order to reduce lookups. At the same time, Unicode scalars are sparse; most of that 21-bit space is unassigned.

Anyway, I had a brief look at the various tree candidates and nothing really suited me. I don't really care so much about lookup time, since the emphasis was on creating something that could support flexible editing operations. I eventually made a simple sort of number-line, which, well, paints regions of a comparable space with a value:

var line = SegmentedLine<Int, String?>(bounds: 0..<100, value: nil)

// After setting values <5 to "small" and values >10 to "large",
// the gap is left with its previous value, "medium".

line.set(0..<20,  to: "medium")
line.set(0..<5,   to: "small")
line.set(10..<60, to: "large")
print(line)
// | [0..<5]: "small" | [5..<10]: "medium" | [10..<60]: "large" | [60..<100]: nil |

let string = "Bob is feeling great"

// Create a SegmentedLine for the collection's contents.
// Start by setting a font attribute over the entire string.

var tags = SegmentedLine(
  bounds: string.startIndex..<string.endIndex,
  value: [Font.custom("Comic Sans")] as [Any]
)

// Set each word to a different color.
// Use 'modify' to append the attribute, but only for the region
// we're modifying.

for word: Substring in string.split(separator: " ") {
  tags.modify(word.startIndex..<word.endIndex) { attributes in
    attributes.append(Color.random())
  }
}

// Check the result.
// - ✅ Every segment still contains the font attribute.
// - ✅ Each word also contains its individual color attribute.

for (range, attributes) in tags.segments {
  print(#""\#(string[range])""#, "-", attributes)
}

// "Bob"     [Font.custom("Comic Sans"), Color.orange]
// " "       [Font.custom("Comic Sans")]
// "is"      [Font.custom("Comic Sans"), Color.green]
// " "       [Font.custom("Comic Sans")]
// "feeling" [Font.custom("Comic Sans"), Color.pink]
// " "       [Font.custom("Comic Sans")]
// "great"   [Font.custom("Comic Sans"), Color.yellow]

// ℹ️ Imagine we have a complex SegmentedLine with lots of small segments
//    capturing granular details, and we'd like to simplify it.

enum ComplexData {
   case categoryA, categoryB, categoryC // ...
}
let complexLine: SegmentedLine<Int, ComplexData> = // ...
print(complexLine)
// | [0..<2]: categoryA | [2..<4]: categoryB | [4..<12]: categoryC | ...

// 1️⃣ Perhaps we can map these to a smaller number of states.

enum SimplifiedData {
   case valid, invalid
}
var simplifiedLine = complexLine.mapValues { complex in
  SimplifiedData(validating: complex)
}
print(simplifiedLine)
// | [0..<2]: valid | [2..<4]: valid | [4..<12]: valid | ...

// 2️⃣ Notice that we have lots of segments for boundaries which
//    which are no longer important. 'combineSegments' can clean them up.

simplifiedLine.combineSegments()
print(simplifiedLine)
// | [0..<2000]: valid | [2000..<2024]: invalid | [2024..<2056]: valid | ...

Implementation: https://github.com/karwa/swift-url/blob/df08ccec114350c4bb845d28c5d5850c20521cca/Sources/UnicodeDataStructures/Shared/SegmentedLine.swift

It has been super helpful to have this thing. In particular, to do something like reduce the Bidi_Class values, there are 2 dead-simple operations to map the elements, and then to perform a kind of restartable left-fold to gather elements in to larger regions. Like this. I actually generate an indexed static data table straight from the SegmentedLine. It's pretty nifty, I think.

It's nothing particularly groundbreaking, but it has been surprisingly effective and, as I said, I couldn't find any good package solution to solve this. I think it's an interesting design space to think of operations that this kind of structure could make easier.

May 21 '22 17:05 karwa

Cc @Azoy

May 23 '22 21:05 lorentey

BTW, I've updated the implementation as I continue to iterate on it (there were some overly-zealous precondition checks, etc). I do also have a full test suite for the version I use, which I can contribute if it's considered a fit for this package.

Also added another example - a DIY AttributedString:

/// A DIY AttributedString from a `{ Collection, RangeTable<Collection.Index, Tag?> }` pair.
///
struct MyAttributedString<Tag> {

  var str: String {
    // Reset all tags any time the string changes (because our indexes become invalid).
    // Obviously not ideal and in a real AttributedString we wouldn't provide direct, mutable access to this.
    didSet { tags = .init(bounds: str.startIndex..<str.endIndex, initialValue: nil) }
  }

  var tags: RangeTable<String.Index, Tag?>

  init() {
    str = ""
    tags = .init(bounds: str.startIndex..<str.endIndex, initialValue: nil)
  }

  var regions: [(Substring, Tag?)] {
    tags.spans.map { range, value in (str[range], value) }
  }
}

var s = MyAttributedString<Int>()
s.str = "hello, world!"
s.tags.set(s.str.startIndex..<s.str.dropFirst(5).startIndex, to: 42)
s.tags.set(s.str.dropLast(5).endIndex..<s.str.endIndex, to: -1)

let expected: [(Substring, Int?)] = [
  ("hello", 42),
  (", w", nil),
  ("orld!", -1),
]
assert(s.regions.elementsEqual(expected, by: { $0.0 == $1.0 && $0.1 == $1.1 }))
s.regions.forEach { print($0) }

Dealing with mutations in such a type relies a lot on how the collection's indexes behave, but it's still useful as a generic utility even if the collection is immutable (you can still tag its data!).

Other interesting uses are timelines - tagging regions of time with "appointments" or other time-centric data, and traversing and manipulating that timeline using its natural coordinates.

May 24 '22 11:05 karwa

@lorentey I added a modify operation and some better examples to the original post, along with a link to the finished implementation (now called SegmentedLine). There are tests in the package, too, of course.

Jun 09 '22 16:06 karwa