arroba icon indicating copy to clipboard operation
arroba copied to clipboard

Unify pre-rollback subscribers

Open snarfed opened this issue 7 months ago • 7 comments

Right now, when the firehose collector (#30) starts up, it preloads a bit of the rollback window, but any cursors before that subscribers read individually, on demand. This works ok, but duplicates work across those subscribers until one of them hits the preload window and merges in its events.

We could unify these historical cursors by making them aware of each others' windows and merging them as soon as they overlap. Would take some delicate work to synchronize right, and it's not really a problem right now, but would be nice to keep from pegging CPU for the first hour or two after startup.

snarfed avatar May 14 '25 01:05 snarfed

This is a bit more acute than I realized. For misconfigured subscribers that always start with the same old cursor, if they don't get through the full rollback window in our 1h request deadline, we bounce and restart them, so their windows don't get merged in until the rollback window organically advances far enough, which will take 6-12h. So CPU will stay pegged at 100% until then. 😕

snarfed avatar May 14 '25 03:05 snarfed

In the meantime, I added an occasional 10ms sleep to pre-rollback subscribers to try to keep them from starving the firehose consumer thread, 1dc0e0b6f5342eaf168a7d48d8db29afc09086dc. Seems to be working.

snarfed avatar May 14 '25 17:05 snarfed

Damn, it wasn't enough. Even with the sleeps, our consumer thread falls behind once we hit 9 subscribers. Guess I need to prioritize this after all.

snarfed avatar May 14 '25 20:05 snarfed

Design notes:

  • Dedicated collector threads for each window. Start one when we start a new window.
  • No preload?
  • Use a thread-safe consumer-oriented structure for each window, eg Queue.
  • Not sure how much to unify window collector and head collector, since head collector needs to eg wait for skipped seqs, log, etc but window collectors don't.

snarfed avatar May 14 '25 21:05 snarfed

Fwiw right now, with just one pre-rollback subscriber, it takes us ~1h to load a full ~45k (50k - 4k preload) rollback window.

Image

snarfed avatar May 19 '25 22:05 snarfed

Maybe deprioritizing? We're holding pretty steady at ~7 relays connected these days, and the last few restarts, we've loaded the rollback window pretty easily, within an hr or so.

snarfed avatar Jul 28 '25 00:07 snarfed

So weird. We always manage to fill the rollback window quickly when the hub restarts on its own, but when we deploy a new version, it consistently takes many hours. I don't understand the difference.

Image

snarfed avatar Aug 05 '25 18:08 snarfed