rustfs icon indicating copy to clipboard operation
rustfs copied to clipboard

fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off

Open tennisleng opened this issue 3 weeks ago • 0 comments

Description

This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.

Root Cause Analysis

The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off:

  • TCP connections don't properly close (no 'goodbye' messages sent)
  • connect().await blocks waiting for OS TCP timeouts (can be 2+ minutes)
  • All cluster operations hang while waiting for the unreachable node
  • The Console Web UI becomes unresponsive

In contrast, when a process is gracefully killed:

  • TCP FIN packets are sent properly
  • Connections are cleanly closed
  • Other nodes quickly detect the failure and continue operating

Changes

Core Fix

Changed from connect().await to connect_lazy() in node_service_time_out_client:

  • connect_lazy() returns immediately without establishing a connection
  • Connection is established lazily on first request
  • Tonic's lazy channel handles automatic reconnection when nodes come back online
  • Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely

Additional Improvements

  • Reduced default request timeout from 60s to 30s for faster failure detection
  • Added clear_connection() and clear_all_connections() helper functions to allow manual clearing of potentially stale connections from the cache

Files Changed

  • crates/protos/src/lib.rs: Changed connection strategy from eager to lazy
  • crates/common/src/globals.rs: Added helper functions for connection cache management

Testing

This fix should be tested by:

  1. Deploy a 4-node rustfs cluster
  2. Verify the cluster is healthy and file uploads work
  3. Abruptly power off one node
  4. Verify the application can still upload files (may see errors for that specific node, but cluster remains responsive)
  5. Verify the Console Web UI remains responsive
  6. Power the node back on and verify it rejoins the cluster

Fixes: #1001

tennisleng avatar Dec 07 '25 23:12 tennisleng