SpacetimeDB icon indicating copy to clipboard operation
SpacetimeDB copied to clipboard

Introduce a `DatabaseLifecycleTracker`, which tracks database lifecycles

Open gefjon opened this issue 7 months ago • 3 comments

Description of Changes

Fixes #2630 .

Perhaps it should be called DatabaseLifecycleManager?

This new object is responsible for tracking the lifecycle of a database, and for cleaning up after the database exits. In particular, it:

  • Unregisters the Host from the containing HostController. This was previously handled by an ad-hoc on-panic callback closure.

  • Aborts the database memory usage metrics reporter task. This was previously handled by a Drop method on Host.

  • Disconnects all connected WebSocket clients. Previously, this didn't happen at all, as per issue 2630.

I've also added some commentary to the WebSocket actor loop.

Follow-up commits will add tests once I've consulted with the team about how best to test this change.

API and ABI breaking changes

N/a

Expected complexity level and risk

3 at least. Database lifecycle management is complicated, and it's possible I've either missed cleaning up some resource, or cleaned it up too eagerly. There's also a lock around the DatabaseLifecycleTracker which is accessed from a few different locations and at one point held across an .await block, which could potentially introduce deadlocks.

Testing

None yet. I will need to discuss with @kim and @jsdt what the best way to test this is.

  • [ ]
  • [ ]

gefjon avatar Apr 21 '25 15:04 gefjon

I manually caused an error using the following diff:

modified   crates/durability/src/imp/local.rs
@@ -195,6 +195,14 @@ impl<T: Encode + Send + Sync + 'static> PersisterTask<T> {
             self.queue_depth.fetch_sub(1, Relaxed);
             trace!("received txdata");
 
+            let time = std::time::SystemTime::now()
+                .duration_since(std::time::SystemTime::UNIX_EPOCH)
+                .unwrap()
+                .as_micros();
+            if time & 0xff == 0xff {
+                panic!("Random panic!");
+            }
+
             // If we are writing one commit per tx, trying to buffer is
             // fairly pointless. Immediately flush instead.
             //

I then ran quickstart-chat with two clients: the normal one, and one that sends messages as fast as possible in a loop.

In both public and private, on both master and this PR, both clients were disconnected immediately (to human perception) when the panic triggered. I will attempt to put the same occasionally-panic code in other places to see if I can reproduce @kim 's issue on master, and from there if this patch fixes the issue.

gefjon avatar Apr 22 '25 16:04 gefjon

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

In any case, it's a client issue.

kim avatar Apr 24 '25 16:04 kim

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 03 '25 18:05 CLAassistant

I'm observing (Rust SDK) clients not being disconnected promptly

Ok, so it looks like reducer calls are accepted indefinitely, even if the connection is closed. This is probably by design, but perhaps we can error when the queue becomes too full.

That does not sound like it's by design.

gefjon avatar May 05 '25 16:05 gefjon

Did we abandon this? @gefjon

kim avatar May 28 '25 08:05 kim