Feat: Add comprehensive error handling and logging system

Open aries043 opened this issue 10 months ago • 1 comments

Summary

This PR introduces a comprehensive error handling and logging system to replace fragile experiment execution with robust, recoverable benchmarking. The change enables resilient experiment management with detailed logging, checkpoint recovery, and graceful error handling while maintaining full backward compatibility.

Problem

The current benchmarking script has several reliability and debugging issues:

# Current fragile approach
try:
    experiment.run(optimizer_class)
    print(f"    runtime: {time.time() - start_time:.5e}.")
except Exception as e:
    print(f"    ERROR: {e}")
    print(f"    runtime: {time.time() - start_time:.5e}.")
    # Entire experiment batch terminates here

This approach has several issues:

Fragile execution model: Single experiment failure causes entire experiment batch to terminate
Poor error visibility: Errors are printed as simple messages without context or detailed information
No recovery mechanism: Failed experiments must restart from the beginning, losing all progress
Limited debugging capability: Insufficient information for troubleshooting complex experiment failures

Solution

1. Structured Logging System

Introduced comprehensive logging with configurable levels and multiple outputs:

def setup_logging(config: ExperimentConfig) -> logging.Logger:
    logger = logging.getLogger('benchmarking')
    logger.setLevel(getattr(logging, config.log_level.upper()))
    
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    # Console and file handlers with structured output

2. Robust Error Handling

Added context manager for graceful error recovery:

@contextmanager
def experiment_error_handler(logger: logging.Logger, experiment_info: str, continue_on_error: bool = True):
    try:
        yield
    except KeyboardInterrupt:
        logger.warning(f"Experiment interrupted by user: {experiment_info}")
    except MemoryError:
        logger.error(f"Memory error in experiment: {experiment_info}")
    except Exception as e:
        logger.error(f"Experiment failed: {experiment_info}")
        logger.debug(f"Full traceback: {traceback.format_exc()}")

3. Checkpoint and Recovery System

Added experiment state management for automatic resumption:

class ExperimentState:
    def __init__(self, config: ExperimentConfig):
        self.completed_experiments = set()
        self.failed_experiments = []
        self.load_checkpoint()
    
    def save_checkpoint(self):
        # Periodic state saving for recovery

4. Enhanced Configuration

Extended configuration with reliability and monitoring options:

@dataclass
class ExperimentConfig:
    continue_on_error: bool = True      # Continue after individual failures
    log_level: str = "INFO"             # Configurable logging verbosity
    log_file: Optional[str] = None      # Optional persistent log file
    checkpoint_interval: int = 5        # Automatic checkpoint frequency

Key Benefits

✅ Resilience: Individual experiment failures don't terminate the entire batch
✅ Observability: Comprehensive logging with timestamps, levels, and structured context
✅ Recoverability: Automatic resumption from checkpoints after interruptions or failures
✅ Debugging: Detailed error information with full tracebacks and experiment context
✅ Progress Tracking: Real-time monitoring of experiment completion status and statistics
✅ Configurability: Adjustable error handling behavior and logging output

Backward Compatibility

✅ Command-line interface unchanged: All existing scripts and commands work exactly the same
✅ Same output format: Results are saved in identical locations with same naming convention
✅ No breaking changes: All existing scripts continue to function without modification
✅ Progressive enhancement: New reliability features are entirely opt-in

Testing

Added comprehensive GitHub Actions workflow that tests:

✅ Structured logging system with multiple handlers and formatters
✅ Error handling context manager with specific exception types
✅ Checkpoint save/load functionality and state management
✅ Enhanced configuration options and validation
✅ Experiment statistics tracking and recovery scenarios

Error Handling Features

Graceful Error Recovery

KeyboardInterrupt: Graceful shutdown with progress preservation
MemoryError: Automatic cleanup and continuation with other experiments
ImportError: Clear module loading error messages with suggestions
General Exceptions: Detailed logging with full traceback information

Checkpoint System

Automatic State Saving: Progress saved every N experiments (configurable)
Resume Capability: Automatic detection and skipping of completed experiments
Failed Experiment Tracking: Detailed failure logs with timestamps for analysis

Comprehensive Logging

Multi-Level Output: Console and optional file logging with different verbosity levels
Structured Messages: Timestamp, level, logger name, and context for all entries
Experiment Tracking: Individual experiment lifecycle logging (start, success, failure)

Files Changed

tutorials/benchmarking_lsbbo_2.py - Added comprehensive error handling and logging system
.github/workflows/test-refactoring-3.yml - Error handling and logging focused test suite

Future Enhancements

This error handling system lays the groundwork for future improvements:

Distributed execution with fault tolerance
Advanced recovery strategies and selective retry mechanisms
Performance monitoring and resource usage optimization
Integration with external monitoring and alerting systems

However, this PR focuses solely on establishing robust error handling and logging infrastructure while maintaining simplicity and backward compatibility.

Testing Instructions:

# Test resilient error handling with batch experiments
python tutorials/benchmarking_lsbbo_2.py --start 0 --end 2 --optimizer CMAES --ndim_problem 10

# Test checkpoint recovery (interrupt with Ctrl+C, then restart)
python tutorials/benchmarking_lsbbo_2.py --start 0 --end 5 --optimizer JADE --ndim_problem 10
# Ctrl+C to interrupt, then restart with same command

# Test debug logging with file output
echo "log_level: DEBUG
log_file: experiments.log" > debug_config.yaml
python tutorials/benchmarking_lsbbo_2.py --config debug_config.yaml --start 0 --end 1 --optimizer PRS --ndim_problem 2

May 30 '25 01:05 aries043

@aries043 Thanks again very much for your suggestion on the logging system. I will integrate it after I check it. TKS Again and Again.

May 30 '25 10:05 Evolutionary-Intelligence