Feat: Add comprehensive error handling and logging system
Summary
This PR introduces a comprehensive error handling and logging system to replace fragile experiment execution with robust, recoverable benchmarking. The change enables resilient experiment management with detailed logging, checkpoint recovery, and graceful error handling while maintaining full backward compatibility.
Problem
The current benchmarking script has several reliability and debugging issues:
# Current fragile approach
try:
experiment.run(optimizer_class)
print(f" runtime: {time.time() - start_time:.5e}.")
except Exception as e:
print(f" ERROR: {e}")
print(f" runtime: {time.time() - start_time:.5e}.")
# Entire experiment batch terminates here
This approach has several issues:
- Fragile execution model: Single experiment failure causes entire experiment batch to terminate
- Poor error visibility: Errors are printed as simple messages without context or detailed information
- No recovery mechanism: Failed experiments must restart from the beginning, losing all progress
- Limited debugging capability: Insufficient information for troubleshooting complex experiment failures
Solution
1. Structured Logging System
Introduced comprehensive logging with configurable levels and multiple outputs:
def setup_logging(config: ExperimentConfig) -> logging.Logger:
logger = logging.getLogger('benchmarking')
logger.setLevel(getattr(logging, config.log_level.upper()))
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
# Console and file handlers with structured output
2. Robust Error Handling
Added context manager for graceful error recovery:
@contextmanager
def experiment_error_handler(logger: logging.Logger, experiment_info: str, continue_on_error: bool = True):
try:
yield
except KeyboardInterrupt:
logger.warning(f"Experiment interrupted by user: {experiment_info}")
except MemoryError:
logger.error(f"Memory error in experiment: {experiment_info}")
except Exception as e:
logger.error(f"Experiment failed: {experiment_info}")
logger.debug(f"Full traceback: {traceback.format_exc()}")
3. Checkpoint and Recovery System
Added experiment state management for automatic resumption:
class ExperimentState:
def __init__(self, config: ExperimentConfig):
self.completed_experiments = set()
self.failed_experiments = []
self.load_checkpoint()
def save_checkpoint(self):
# Periodic state saving for recovery
4. Enhanced Configuration
Extended configuration with reliability and monitoring options:
@dataclass
class ExperimentConfig:
continue_on_error: bool = True # Continue after individual failures
log_level: str = "INFO" # Configurable logging verbosity
log_file: Optional[str] = None # Optional persistent log file
checkpoint_interval: int = 5 # Automatic checkpoint frequency
Key Benefits
- ✅ Resilience: Individual experiment failures don't terminate the entire batch
- ✅ Observability: Comprehensive logging with timestamps, levels, and structured context
- ✅ Recoverability: Automatic resumption from checkpoints after interruptions or failures
- ✅ Debugging: Detailed error information with full tracebacks and experiment context
- ✅ Progress Tracking: Real-time monitoring of experiment completion status and statistics
- ✅ Configurability: Adjustable error handling behavior and logging output
Backward Compatibility
- ✅ Command-line interface unchanged: All existing scripts and commands work exactly the same
- ✅ Same output format: Results are saved in identical locations with same naming convention
- ✅ No breaking changes: All existing scripts continue to function without modification
- ✅ Progressive enhancement: New reliability features are entirely opt-in
Testing
Added comprehensive GitHub Actions workflow that tests:
- ✅ Structured logging system with multiple handlers and formatters
- ✅ Error handling context manager with specific exception types
- ✅ Checkpoint save/load functionality and state management
- ✅ Enhanced configuration options and validation
- ✅ Experiment statistics tracking and recovery scenarios
Error Handling Features
Graceful Error Recovery
- KeyboardInterrupt: Graceful shutdown with progress preservation
- MemoryError: Automatic cleanup and continuation with other experiments
- ImportError: Clear module loading error messages with suggestions
- General Exceptions: Detailed logging with full traceback information
Checkpoint System
- Automatic State Saving: Progress saved every N experiments (configurable)
- Resume Capability: Automatic detection and skipping of completed experiments
- Failed Experiment Tracking: Detailed failure logs with timestamps for analysis
Comprehensive Logging
- Multi-Level Output: Console and optional file logging with different verbosity levels
- Structured Messages: Timestamp, level, logger name, and context for all entries
- Experiment Tracking: Individual experiment lifecycle logging (start, success, failure)
Files Changed
-
tutorials/benchmarking_lsbbo_2.py- Added comprehensive error handling and logging system -
.github/workflows/test-refactoring-3.yml- Error handling and logging focused test suite
Future Enhancements
This error handling system lays the groundwork for future improvements:
- Distributed execution with fault tolerance
- Advanced recovery strategies and selective retry mechanisms
- Performance monitoring and resource usage optimization
- Integration with external monitoring and alerting systems
However, this PR focuses solely on establishing robust error handling and logging infrastructure while maintaining simplicity and backward compatibility.
Testing Instructions:
# Test resilient error handling with batch experiments
python tutorials/benchmarking_lsbbo_2.py --start 0 --end 2 --optimizer CMAES --ndim_problem 10
# Test checkpoint recovery (interrupt with Ctrl+C, then restart)
python tutorials/benchmarking_lsbbo_2.py --start 0 --end 5 --optimizer JADE --ndim_problem 10
# Ctrl+C to interrupt, then restart with same command
# Test debug logging with file output
echo "log_level: DEBUG
log_file: experiments.log" > debug_config.yaml
python tutorials/benchmarking_lsbbo_2.py --config debug_config.yaml --start 0 --end 1 --optimizer PRS --ndim_problem 2
@aries043 Thanks again very much for your suggestion on the logging system. I will integrate it after I check it. TKS Again and Again.