jitsu icon indicating copy to clipboard operation
jitsu copied to clipboard

Improve and clarify error retry logic

Open absorbb opened this issue 2 years ago • 1 comments

Problem

  • Currently Jitsu retries error infinitely - that doesn't make much sense because many kind of errors cannot be solved with retries.
  • For streaming storages error leads to growth of redis queue
  • Retry and fallback logic is not clear and not documented

Solution

  • Introduce server.error_retry_period_hours configuration parameter that will work as default for all destinations (streaming and batch). Default value: 24 hours
  • Introduce DestinationConfig error_retry_period_hours parameter that will override default value on destination level.

uploader.go unify fallback logic:

  • all errors (parsingErrors, failedEvents, resultPerTable.result.Err) must go to Fallback only after error_retry_period_hours passes. (seems that currently for parsingErrors and failedEvents we flood fallback logs with copies of the same events on each uploader run)
  • after error_retry_period_hours passes jitsu needs to archive incoming file and cleanup status

streaming.go:

  • don't use IsConnectionError check – retry all errors
  • instead of 20 sec hardcode introduce server.streaming_retry_delay_minutes parameter. Default: 1
  • after server.error_retry_period_hours passes - stop retries and Fallback error events.
  • Current fallback logic is hidden in abstract.go AccountResult and must be removed from there.

documentation:

  • write Error Handling and Retries documentation page that describes that logic and configuration parameters

absorbb avatar Jul 27 '22 08:07 absorbb

No changes in uploader.go yet

Current fallback logic is hidden in abstract.go AccountResult and must be removed from there.

Not addressed yet.

absorbb avatar Sep 05 '22 07:09 absorbb