rustfs
rustfs copied to clipboard
fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off
Description
This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.
Root Cause Analysis
The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off:
- TCP connections don't properly close (no 'goodbye' messages sent)
connect().awaitblocks waiting for OS TCP timeouts (can be 2+ minutes)- All cluster operations hang while waiting for the unreachable node
- The Console Web UI becomes unresponsive
In contrast, when a process is gracefully killed:
- TCP FIN packets are sent properly
- Connections are cleanly closed
- Other nodes quickly detect the failure and continue operating
Changes
Core Fix
Changed from connect().await to connect_lazy() in node_service_time_out_client:
connect_lazy()returns immediately without establishing a connection- Connection is established lazily on first request
- Tonic's lazy channel handles automatic reconnection when nodes come back online
- Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely
Additional Improvements
- Reduced default request timeout from 60s to 30s for faster failure detection
- Added
clear_connection()andclear_all_connections()helper functions to allow manual clearing of potentially stale connections from the cache
Files Changed
crates/protos/src/lib.rs: Changed connection strategy from eager to lazycrates/common/src/globals.rs: Added helper functions for connection cache management
Testing
This fix should be tested by:
- Deploy a 4-node rustfs cluster
- Verify the cluster is healthy and file uploads work
- Abruptly power off one node
- Verify the application can still upload files (may see errors for that specific node, but cluster remains responsive)
- Verify the Console Web UI remains responsive
- Power the node back on and verify it rejoins the cluster
Fixes: #1001