Improvements to event observer management
I am relaying this suggestion from a user:
Currently, sending an event to an observer is a blocking action, meaning that the observer can block the stacks-node. This is intended behavior, to ensure that the observer receives all events, but it could be handled better without blocking the core functionality of the node. The proposal is to make more use of the event_observers.sqlite DB. The node could write the event to the DB, then signal to another thread that the new event is ready to be sent. This other thread may be blocked by the observer, but the node could continue with its normal activities after writing the event into the DB.
One caveat is that the event observer could fall behind the node, and cause the event_observers.sqlite DB to grow unbounded. How do you imagine that we'd address this?
One caveat is that the event observer could fall behind the node, and cause the
event_observers.sqliteDB to grow unbounded. How do you imagine that we'd address this?
Couple of thoughts, in random order:
- The status quo is that the node just stops working altogether. Isn't that worse?
- If this is really a practical concern, we could put an upper limit on the number of pending events before we start blocking again (or even crashing? I'm not yet familiar enough with everything to have an intuition on whether that would be appropriate).
- My assumption (which may be totally wrong, please correct me!) is that event DB growth would be O(chainstate growth) with a reasonable constant, and thus acceptable in practice
- We could make the
process_pending_payloadscall blocking. That way, at least on node restart, the owner gets the message "fix your observer or unregister it now" - A common pattern for webhooks (which these ultimately are) is that after a certain number of failures, the caller stops sending them. This isn't 100% the same thing because the same entity probably controls receiver and sender, but it might still be an option to consider.
I'd be happy with just an error log if the database grows too large. If that's not noticeable enough, we could stall the node at some point, but I'm not sure if that's necessary.
Okay, here's my current thinking. Although everytime I write something down, I realize another edge case 😬
- make
EventObserverinto just a dumb struct without a lot of functionality (I think that used to be the case once?) -- just a copy of the observer config EventDispatcheris where all the magic happensEventDispatcher::newspins up a single extra thread and stores a mechanism to wake up that subthread- (event dispatchers get cloned in a couple of places, but there is only a single subthread)
- (want to avoid message passing because an unbounded channel to a blocked subthread would leak memory)
- (there should probably be a reference counting-based mechanism that tears down the subthread when the final
EventDispatcheris dropped -- probably only relevant in tests) - (TODO: figure out how to handle multiple event dispatchers that would be talking to the same DB -- some sort of unique identifier on DB records?)
- The subthread grabs a payload from the DB (oldest first), makes the request, deletes on success.
disable_retryis added as a DB column so the subthread can honor it- (could add a time stamp to the DB; the subthread could figure out how long a message has been pending and issue a warning if that delay gets long)
- If there's no DB, HTTP requests continue on the main thread (I think this only happens in tests?)
process_pending_replieson startup is still blocking
With this, a single broken event subscriber will still block all other events from being delivered, because there's just a single thread. I think that makes sense because it feels like there's an expectation the events are delivered in order. It might also make it more obvious to the operator that something's wrong.
Let me know if you have any thoughts; I'll start with trying to implement something on Monday.