stacks-core icon indicating copy to clipboard operation
stacks-core copied to clipboard

Improvements to event observer management

Open obycode opened this issue 2 months ago • 3 comments

I am relaying this suggestion from a user:

Currently, sending an event to an observer is a blocking action, meaning that the observer can block the stacks-node. This is intended behavior, to ensure that the observer receives all events, but it could be handled better without blocking the core functionality of the node. The proposal is to make more use of the event_observers.sqlite DB. The node could write the event to the DB, then signal to another thread that the new event is ready to be sent. This other thread may be blocked by the observer, but the node could continue with its normal activities after writing the event into the DB.

obycode avatar Sep 30 '25 16:09 obycode

One caveat is that the event observer could fall behind the node, and cause the event_observers.sqlite DB to grow unbounded. How do you imagine that we'd address this?

jcnelson avatar Dec 10 '25 15:12 jcnelson

One caveat is that the event observer could fall behind the node, and cause the event_observers.sqlite DB to grow unbounded. How do you imagine that we'd address this?

Couple of thoughts, in random order:

  • The status quo is that the node just stops working altogether. Isn't that worse?
  • If this is really a practical concern, we could put an upper limit on the number of pending events before we start blocking again (or even crashing? I'm not yet familiar enough with everything to have an intuition on whether that would be appropriate).
  • My assumption (which may be totally wrong, please correct me!) is that event DB growth would be O(chainstate growth) with a reasonable constant, and thus acceptable in practice
  • We could make the process_pending_payloads call blocking. That way, at least on node restart, the owner gets the message "fix your observer or unregister it now"
  • A common pattern for webhooks (which these ultimately are) is that after a certain number of failures, the caller stops sending them. This isn't 100% the same thing because the same entity probably controls receiver and sender, but it might still be an option to consider.

benjamin-stacks avatar Dec 11 '25 13:12 benjamin-stacks

I'd be happy with just an error log if the database grows too large. If that's not noticeable enough, we could stall the node at some point, but I'm not sure if that's necessary.

brice-stacks avatar Dec 12 '25 01:12 brice-stacks

Okay, here's my current thinking. Although everytime I write something down, I realize another edge case 😬

  • make EventObserver into just a dumb struct without a lot of functionality (I think that used to be the case once?) -- just a copy of the observer config
  • EventDispatcher is where all the magic happens
  • EventDispatcher::new spins up a single extra thread and stores a mechanism to wake up that subthread
  • (event dispatchers get cloned in a couple of places, but there is only a single subthread)
  • (want to avoid message passing because an unbounded channel to a blocked subthread would leak memory)
  • (there should probably be a reference counting-based mechanism that tears down the subthread when the final EventDispatcher is dropped -- probably only relevant in tests)
  • (TODO: figure out how to handle multiple event dispatchers that would be talking to the same DB -- some sort of unique identifier on DB records?)
  • The subthread grabs a payload from the DB (oldest first), makes the request, deletes on success.
  • disable_retry is added as a DB column so the subthread can honor it
  • (could add a time stamp to the DB; the subthread could figure out how long a message has been pending and issue a warning if that delay gets long)
  • If there's no DB, HTTP requests continue on the main thread (I think this only happens in tests?)
  • process_pending_replies on startup is still blocking

With this, a single broken event subscriber will still block all other events from being delivered, because there's just a single thread. I think that makes sense because it feels like there's an expectation the events are delivered in order. It might also make it more obvious to the operator that something's wrong.

Let me know if you have any thoughts; I'll start with trying to implement something on Monday.

benjamin-stacks avatar Dec 12 '25 15:12 benjamin-stacks