SpacetimeDB
SpacetimeDB copied to clipboard
Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles
Description of Changes
Fixes #2630 .
Perhaps it should be called DatabaseLifecycleManager?
This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits. In particular, it:
-
Unregisters the
Hostfrom the containingHostController. This was previously handled by an ad-hoc on-panic callback closure. -
Aborts the database memory usage metrics reporter task. This was previously handled by a
Dropmethod onHost. -
Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630.
I've also added some commentary to the WebSocket actor loop.
Follow-up commits will add tests once I've consulted with the team about how best to test this change.
API and ABI breaking changes
N/a
Expected complexity level and risk
3 at least. Database lifecycle management is complicated, and it's possible I've either missed cleaning up some resource, or cleaned it up too eagerly. There's also a lock around the DatabaseLifecycleTracker which is accessed from a few different locations and at one point held across an .await block, which could potentially introduce deadlocks.
Testing
None yet. I will need to discuss with @kim and @jsdt what the best way to test this is.
- [ ]
- [ ]
I manually caused an error using the following diff:
modified crates/durability/src/imp/local.rs
@@ -195,6 +195,14 @@ impl<T: Encode + Send + Sync + 'static> PersisterTask<T> {
self.queue_depth.fetch_sub(1, Relaxed);
trace!("received txdata");
+ let time = std::time::SystemTime::now()
+ .duration_since(std::time::SystemTime::UNIX_EPOCH)
+ .unwrap()
+ .as_micros();
+ if time & 0xff == 0xff {
+ panic!("Random panic!");
+ }
+
// If we are writing one commit per tx, trying to buffer is
// fairly pointless. Immediately flush instead.
//
I then ran quickstart-chat with two clients: the normal one, and one that sends messages as fast as possible in a loop.
In both public and private, on both master and this PR, both clients were disconnected immediately (to human perception) when the panic triggered. I will attempt to put the same occasionally-panic code in other places to see if I can reproduce @kim 's issue on master, and from there if this patch fixes the issue.
I'm observing (Rust SDK) clients not being disconnected promptly
Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.
In any case, it's a client issue.
I'm observing (Rust SDK) clients not being disconnected promptly
Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.
That does not sound like it's by design.
Did we abandon this? @gefjon