rustfs
rustfs copied to clipboard
Add Disk Timeout and Health Check Functionality
Summary
This PR introduces comprehensive disk timeout and health check functionality to improve system reliability and fault tolerance. The implementation adds active monitoring, configurable timeouts, and automatic fault detection for disk operations.
Type of Change
- [ ] New Feature
- [x] Bug Fix
- [ ] Documentation
- [x] Performance Improvement
- [ ] Test/CI
- [x] Refactor
- [ ] Other:
Related Issues
#1001
Key Features
Disk Health Monitoring
- Active Disk Monitoring: Continuous health checks with configurable intervals (default: 15 seconds)
- Health Status Tracking: Atomic tracking of disk health states (OK/FAULTY)
- Automatic Recovery: Faulty disks can be marked as healthy upon successful operations
Timeout Management
- Configurable Timeouts: Environment variable
RUSTFS_DRIVE_MAX_TIMEOUT_DURATIONfor maximum timeout duration - Operation-Level Timeouts: Individual timeouts for read, write, and health check operations
- Graceful Degradation: Timeout failures are logged and handled gracefully without crashing
Network Operations
- TCP Connection Timeouts: 5-second timeout for TCP connectivity checks
- Remote Disk Operations: Timeout protection for peer-to-peer disk operations
- S3 Client Timeouts: Timeout handling for S3-compatible peer client operations
Technical Implementation
New Components
DiskHealthTracker: Thread-safe disk health state managementLocalDiskWrapper: Wrapper for local disk with health monitoring- Enhanced timeout utilities in remote disk and peer client modules
Configuration
RUSTFS_DRIVE_ACTIVE_MONITORING: Enable/disable active monitoring (default: enabled)RUSTFS_DRIVE_MAX_TIMEOUT_DURATION: Maximum allowed timeout duration- Health check intervals: 15 seconds with 5-second skip window for recent successes
Files Modified
- Added:
crates/ecstore/src/disk/disk_store.rs(769 lines) - Modified: Remote disk client, peer S3 client, and disk management modules
- Updated: Configuration handling and timeout utilities
Benefits
- Improved Reliability: Automatic detection and handling of faulty disks
- Better Performance: Prevents hanging operations with configurable timeouts
- Enhanced Observability: Detailed logging for timeout and health check events
- Fault Tolerance: Graceful handling of network and disk failures
- Configurable Behavior: Environment-based configuration for different deployment scenarios
Testing
The implementation includes comprehensive error handling and logging. Health checks run continuously in background tasks with proper cancellation support for clean shutdowns.
Breaking Changes
None. All new functionality is additive and backward-compatible.
Related Issues
Addresses disk reliability concerns and timeout handling requirements in distributed storage systems.