rustfs icon indicating copy to clipboard operation
rustfs copied to clipboard

Add Disk Timeout and Health Check Functionality

Open weisd opened this issue 1 week ago • 1 comments

Summary

This PR introduces comprehensive disk timeout and health check functionality to improve system reliability and fault tolerance. The implementation adds active monitoring, configurable timeouts, and automatic fault detection for disk operations.

Type of Change

  • [ ] New Feature
  • [x] Bug Fix
  • [ ] Documentation
  • [x] Performance Improvement
  • [ ] Test/CI
  • [x] Refactor
  • [ ] Other:

Related Issues

#1001

Key Features

Disk Health Monitoring

  • Active Disk Monitoring: Continuous health checks with configurable intervals (default: 15 seconds)
  • Health Status Tracking: Atomic tracking of disk health states (OK/FAULTY)
  • Automatic Recovery: Faulty disks can be marked as healthy upon successful operations

Timeout Management

  • Configurable Timeouts: Environment variable RUSTFS_DRIVE_MAX_TIMEOUT_DURATION for maximum timeout duration
  • Operation-Level Timeouts: Individual timeouts for read, write, and health check operations
  • Graceful Degradation: Timeout failures are logged and handled gracefully without crashing

Network Operations

  • TCP Connection Timeouts: 5-second timeout for TCP connectivity checks
  • Remote Disk Operations: Timeout protection for peer-to-peer disk operations
  • S3 Client Timeouts: Timeout handling for S3-compatible peer client operations

Technical Implementation

New Components

  • DiskHealthTracker: Thread-safe disk health state management
  • LocalDiskWrapper: Wrapper for local disk with health monitoring
  • Enhanced timeout utilities in remote disk and peer client modules

Configuration

  • RUSTFS_DRIVE_ACTIVE_MONITORING: Enable/disable active monitoring (default: enabled)
  • RUSTFS_DRIVE_MAX_TIMEOUT_DURATION: Maximum allowed timeout duration
  • Health check intervals: 15 seconds with 5-second skip window for recent successes

Files Modified

  • Added: crates/ecstore/src/disk/disk_store.rs (769 lines)
  • Modified: Remote disk client, peer S3 client, and disk management modules
  • Updated: Configuration handling and timeout utilities

Benefits

  1. Improved Reliability: Automatic detection and handling of faulty disks
  2. Better Performance: Prevents hanging operations with configurable timeouts
  3. Enhanced Observability: Detailed logging for timeout and health check events
  4. Fault Tolerance: Graceful handling of network and disk failures
  5. Configurable Behavior: Environment-based configuration for different deployment scenarios

Testing

The implementation includes comprehensive error handling and logging. Health checks run continuously in background tasks with proper cancellation support for clean shutdowns.

Breaking Changes

None. All new functionality is additive and backward-compatible.

Related Issues

Addresses disk reliability concerns and timeout handling requirements in distributed storage systems.

weisd avatar Dec 19 '25 08:12 weisd

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

github-actions[bot] avatar Dec 19 '25 09:12 github-actions[bot]