Add Disk Timeout and Health Check Functionality

Open weisd opened this issue 1 week ago • 1 comments

Summary

This PR introduces comprehensive disk timeout and health check functionality to improve system reliability and fault tolerance. The implementation adds active monitoring, configurable timeouts, and automatic fault detection for disk operations.

Type of Change

[ ] New Feature
[x] Bug Fix
[ ] Documentation
[x] Performance Improvement
[ ] Test/CI
[x] Refactor
[ ] Other:

Related Issues

#1001

Key Features

Disk Health Monitoring

Active Disk Monitoring: Continuous health checks with configurable intervals (default: 15 seconds)
Health Status Tracking: Atomic tracking of disk health states (OK/FAULTY)
Automatic Recovery: Faulty disks can be marked as healthy upon successful operations

Timeout Management

Configurable Timeouts: Environment variable RUSTFS_DRIVE_MAX_TIMEOUT_DURATION for maximum timeout duration
Operation-Level Timeouts: Individual timeouts for read, write, and health check operations
Graceful Degradation: Timeout failures are logged and handled gracefully without crashing

Network Operations

TCP Connection Timeouts: 5-second timeout for TCP connectivity checks
Remote Disk Operations: Timeout protection for peer-to-peer disk operations
S3 Client Timeouts: Timeout handling for S3-compatible peer client operations

Technical Implementation

New Components

DiskHealthTracker: Thread-safe disk health state management
LocalDiskWrapper: Wrapper for local disk with health monitoring
Enhanced timeout utilities in remote disk and peer client modules

Configuration

RUSTFS_DRIVE_ACTIVE_MONITORING: Enable/disable active monitoring (default: enabled)
RUSTFS_DRIVE_MAX_TIMEOUT_DURATION: Maximum allowed timeout duration
Health check intervals: 15 seconds with 5-second skip window for recent successes

Files Modified

Added: crates/ecstore/src/disk/disk_store.rs (769 lines)
Modified: Remote disk client, peer S3 client, and disk management modules
Updated: Configuration handling and timeout utilities

Benefits

Improved Reliability: Automatic detection and handling of faulty disks
Better Performance: Prevents hanging operations with configurable timeouts
Enhanced Observability: Detailed logging for timeout and health check events
Fault Tolerance: Graceful handling of network and disk failures
Configurable Behavior: Environment-based configuration for different deployment scenarios

Testing

The implementation includes comprehensive error handling and logging. Health checks run continuously in background tasks with proper cancellation support for clean shutdowns.

Breaking Changes

None. All new functionality is additive and backward-compatible.

Related Issues

Addresses disk reliability concerns and timeout handling requirements in distributed storage systems.

Dec 19 '25 08:12 weisd

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Dec 19 '25 09:12 github-actions[bot]