workerd icon indicating copy to clipboard operation
workerd copied to clipboard

Fix: DOs take 2 mins to shutdown after using TCP socket

Open ds300 opened this issue 1 month ago • 11 comments

Hello 👋🏼

Normally Durable Objects will shut down after 10 seconds of inactivity. I noticed that if I used the TCP sockets API that number jumped up to 2 minutes, both locally and in production.

Root Cause

When a socket is created, setupSocket() creates a watchForDisconnectTask that waits on connection.whenWriteDisconnected(). This task is used for detecting unexpected disconnections (network failures, remote peer dropping while idle). However, the promise doesn't resolve until the TCP connection fully terminates at the OS level—which can take up to 2 minutes due to TCP TIME_WAIT.

When close() was called (or when the remote closed and EOF was detected), it would resolve the closed promise but leave this background task running. It would keep the IoContext alive, which kept an actor reference alive, preventing the DO from becoming "inactive" and starting its 10-second eviction timer.

The Fix

Cancel watchForDisconnectTask when the socket is closed, either:

  • Explicitly via close()
  • Automatically via maybeCloseWriteSide() when remote closes and allowHalfOpen is false

The task was already designed to handle cancellation gracefully (there's a kj::defer that fulfills the disconnect promise with a "cancelled" flag, and the downstream handler ignores cancelled notifications). This is the same cleanup path that runs when the Socket is garbage collected.

What This Doesn't Affect

  • Remote close detection still works (handled via read stream EOF)
  • The closed promise behavior is unchanged
  • TCP still goes through proper shutdown sequence—we're just not waiting for OS-level confirmation
  • Sockets that aren't explicitly closed still have the disconnect watcher active for detecting unexpected drops

Testing

  • Manually verified: DO with socket now evicts in ~10 seconds after close() instead of ~2 minutes
  • Manually verified: Remote close (killing the peer while reading) also allows normal eviction timing
  • Existing socket tests should pass (no API behavior changes)

ds300 avatar Dec 05 '25 16:12 ds300