vault-secrets-operator icon indicating copy to clipboard operation
vault-secrets-operator copied to clipboard

Use exponential backoffs on secret source errors.

Open benashz opened this issue 1 year ago • 1 comments

Previously, the back off duration was based on a fixed duration + some jitter. This PR introduces exponential back offs for all secret syncing controllers. The back off will be calculated and honored whenever an error is encountered while fetching from a secret source e.g: Vault, HCPVS. The back off configuration is controlled via some new command line arguments:

  -back-off-initial-interval duration
        Initial interval between retries on secret source errors. All errors are tried using an exponential backoff strategy. Also set from environment variable VSO_BACK_OFF_INITIAL_INTERVAL. (default 5s)
  -back-off-max-interval duration
        Maximum interval between retries on secret source errors. All errors are tried using an exponential backoff strategy. Also set from environment variable VSO_BACK_OFF_MAX_INTERVAL. (default 1m0s)
  -back-off-multiplier float
        Sets the multiplier for increasing the interval between retries on secret source errors. All errors are tried using an exponential backoff strategy. Also set from environment variable VSO_BACK_OFF_MULTIPLIER. (default 1.5)
  -back-off-randomization-factor float
        Sets the randomization factor to add jitter to the interval between retries on secret source errors. All errors are tried using an exponential backoff strategy. Also set from environment variable VSO_BACK_OFF_RANDOMIZATION_FACTOR. (default 0.5)

Or through the Helm chart values:

    # Backoff settings for the controller manager. These settings control the backoff behavior
    # when the controller encounters an error while fetching secrets from the SecretSource.
    backOffOnSecretSourceError:
      # Initial interval between retries.
      # @type: duration
      initialInterval: "5s"
      # Maximum interval between retries.
      # @type: duration
      maxInterval: "60s"
      # Randomization factor to add jitter to the interval between retries.
      # @type: float
      randomizationFactor: 0.5
      # Sets the multiplier for increasing the interval between retries.
      # @type: float
      multiplier: 1.5

benashz avatar May 08 '24 16:05 benashz

👍

I wonder if it would make sense to log the backoff settings on startup? Just to make it easier for users to see what's set.

Thanks! I made that change in 15e14c91bacf2ebdcfc5ad3888975587fe0287c3. I also added a new Prometheus metric that includes the same info:

vso_runtime_config{backOffInitialInterval="5s",backOffMaxInterval="1m0s",backOffMultiplier="1.50",backOffRandomizationFactor="0.50",clientCachePersistenceModel="direct-encrypted",clientCacheSize="10000",globalTransformationOptions="",maxConcurrentReconciles="100"} 1

benashz avatar May 16 '24 01:05 benashz