postgres icon indicating copy to clipboard operation
postgres copied to clipboard

test: error handle, state mgmt, backoff, timeouts

Open samrose opened this issue 9 months ago • 2 comments

What kind of change does this PR introduce?

EC2 Test Resilience Improvements

Retry Wrapper Function

  • Added retry_with_backoff decorator that implements exponential backoff
  • Configurable retry attempts, delays, and exception types
  • Proper logging of retry attempts and failures

Error Handling and Logging

  • Added comprehensive error handling throughout the code
  • Improved logging with detailed messages and error context
  • Added proper exception handling for AWS API calls

Instance State Management

  • Added wait_for_instance_running function with retries
  • Added proper state validation before proceeding
  • Added timeout for instance state transitions

Backoff Strategy

  • Implemented exponential backoff in the retry decorator
  • Configurable initial delay and maximum delay
  • Proper sleep intervals between retries

Resource Validation

  • Added validate_aws_resources function to check security groups and IAM roles
  • Validates resources before instance creation
  • Provides clear error messages for validation failures

Simplified Startup

  • Broke down the instance creation process into smaller, focused functions
  • Each function has a single responsibility
  • Better error isolation and handling

AWS API Timeouts

  • Added proper timeouts for SSH connections
  • Added timeout for health checks
  • Added timeout for instance state transitions

Robust Health Checks

  • Improved health check system with proper error handling
  • Added timeout for health checks
  • Better logging of health check failures
  • Separate function for checking individual services

Cleanup Code

  • Added proper cleanup in finally block
  • Ensures instance termination even on failures
  • Logs cleanup failures

Detailed Logging

  • Added comprehensive logging throughout
  • Logs all major operations and state transitions
  • Logs errors with proper context
  • Helps diagnose failures

samrose avatar Apr 14 '25 18:04 samrose

Will this help the sporadic timeouts we get from time to time on the testinfra CI job?

steve-chavez avatar Apr 14 '25 20:04 steve-chavez

Will this help the sporadic timeouts we get from time to time on the testinfra CI job?

Steve, yes, I am trying to target that. I am going to wait on this until I finish https://github.com/supabase/postgres/pull/1547 as that will let me iterate on this locally

samrose avatar Apr 14 '25 20:04 samrose