Unify pre-rollback subscribers
Right now, when the firehose collector (#30) starts up, it preloads a bit of the rollback window, but any cursors before that subscribers read individually, on demand. This works ok, but duplicates work across those subscribers until one of them hits the preload window and merges in its events.
We could unify these historical cursors by making them aware of each others' windows and merging them as soon as they overlap. Would take some delicate work to synchronize right, and it's not really a problem right now, but would be nice to keep from pegging CPU for the first hour or two after startup.
This is a bit more acute than I realized. For misconfigured subscribers that always start with the same old cursor, if they don't get through the full rollback window in our 1h request deadline, we bounce and restart them, so their windows don't get merged in until the rollback window organically advances far enough, which will take 6-12h. So CPU will stay pegged at 100% until then. 😕
In the meantime, I added an occasional 10ms sleep to pre-rollback subscribers to try to keep them from starving the firehose consumer thread, 1dc0e0b6f5342eaf168a7d48d8db29afc09086dc. Seems to be working.
Damn, it wasn't enough. Even with the sleeps, our consumer thread falls behind once we hit 9 subscribers. Guess I need to prioritize this after all.
Design notes:
- Dedicated collector threads for each window. Start one when we start a new window.
- No preload?
- Use a thread-safe consumer-oriented structure for each window, eg
Queue. - Not sure how much to unify window collector and head collector, since head collector needs to eg wait for skipped seqs, log, etc but window collectors don't.
Fwiw right now, with just one pre-rollback subscriber, it takes us ~1h to load a full ~45k (50k - 4k preload) rollback window.
Maybe deprioritizing? We're holding pretty steady at ~7 relays connected these days, and the last few restarts, we've loaded the rollback window pretty easily, within an hr or so.
So weird. We always manage to fill the rollback window quickly when the hub restarts on its own, but when we deploy a new version, it consistently takes many hours. I don't understand the difference.