java-sdk-contrib icon indicating copy to clipboard operation
java-sdk-contrib copied to clipboard

feat(flagd): Improve flagd retry logic and error logging

Open guidobrei opened this issue 4 months ago • 4 comments

This PR

  • Improves error logging for in process resolver with remote mode.
  • Harmonizes backoff implementations across different gRPC handlers.
  • Uses FlagdOptions.getRetryBackoffMs() to initialize the backoff in all Backoff scenarios. GrpcStreamConnector previously used a hardcoded value of 2 seconds.
  • Immediately reconnect on first stream error in GrpcStreamConnector. This removes a backoff when a planned deadline exceeds and the connector reconnects.
  • Unified standard max jitter of 250ms for all backoff use-cases

Fixes #1010

Notes

Different to #1010, error logs are not written when the max retry delay is reached, but already at the second error in a row. Waiting for max retry delay (120 seconds) with exponential backoff starting with 2 seconds would require 126 seconds until the first error gets visible.

Instead, error logs are generated whenever an error queue payload is emitted. Only on the first error we try to reconnect immediately without any backoff (only with default jitter 250ms max) and without emitting an error payload. Starting with the second error in a row we log an error and emit the error payload.

The initial Backoff is now FlagdOptions.getRetryBackoffMs() in GrpcStreamConnector (new) and GrpcConnector (no change). For the GrpcStreamConnector this means an initial Backoff of 1 sec (default option) instead of 2 secs.

I've also removed the special handling of DEADLINE_EXCEEDED' errors, as the connector now tries to reconnect silently on any first error. This also solves DEADLINE_EXCEEDED` issues related to Envoy, where a wrong gRPC status code is reported. See here

With the first immediate retry the new Backoff times for GrpcStreamConnector are now:

  • 0s
  • 1s
  • 2s
  • 4s
  • 8s
  • ...
  • 120s

guidobrei avatar Oct 10 '24 19:10 guidobrei