fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off

Open tennisleng opened this issue 3 weeks ago • 0 comments

Description

This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.

Root Cause Analysis

The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off:

TCP connections don't properly close (no 'goodbye' messages sent)
connect().await blocks waiting for OS TCP timeouts (can be 2+ minutes)
All cluster operations hang while waiting for the unreachable node
The Console Web UI becomes unresponsive

In contrast, when a process is gracefully killed:

TCP FIN packets are sent properly
Connections are cleanly closed
Other nodes quickly detect the failure and continue operating

Changes

Core Fix

Changed from connect().await to connect_lazy() in node_service_time_out_client:

connect_lazy() returns immediately without establishing a connection
Connection is established lazily on first request
Tonic's lazy channel handles automatic reconnection when nodes come back online
Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely

Additional Improvements

Reduced default request timeout from 60s to 30s for faster failure detection
Added clear_connection() and clear_all_connections() helper functions to allow manual clearing of potentially stale connections from the cache

Files Changed

crates/protos/src/lib.rs: Changed connection strategy from eager to lazy
crates/common/src/globals.rs: Added helper functions for connection cache management

Testing

This fix should be tested by:

Deploy a 4-node rustfs cluster
Verify the cluster is healthy and file uploads work
Abruptly power off one node
Verify the application can still upload files (may see errors for that specific node, but cluster remains responsive)
Verify the Console Web UI remains responsive
Power the node back on and verify it rejoins the cluster

Fixes: #1001

Dec 07 '25 23:12 tennisleng