quickwit
quickwit copied to clipboard
Retry upload on error.
Some object storage can fail from to time. In that case we should retry instead of restarting the pipeline and losing all of the work done.
This is low priority.
We should already be retrying a few times (3?). Is that not working, or does the transient storage issue persist for a duration longer than our retry delay?
The number of retries was set to a low value because, on the search side, we wanted some quick feedback when a storage issue occurred.
We could have a dedicated retry policy for PUT requests.
we already have retries on the storage for writes, up to 5 times. Some error were not retried, but now are since https://github.com/quickwit-oss/quickwit/pull/5384, so I think this ticket can be closed
@trinity-1686a Do we retry on S3 internal errors?
we retry based on what the sdk defines as transient and throttling errors, list here: https://docs.rs/aws-runtime/1.4.3/src/aws_runtime/retries/classifiers.rs.html#18-36 It doesn't include InternalError, so we don't retry on that
we do retry upload now for all transient errors we could find. If more error conditions should be retried, that should be a separate ticket