quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Retry upload on error.

Open fulmicoton opened this issue 1 year ago • 5 comments

Some object storage can fail from to time. In that case we should retry instead of restarting the pipeline and losing all of the work done.

fulmicoton avatar Mar 25 '24 06:03 fulmicoton

This is low priority.

fulmicoton avatar Mar 25 '24 06:03 fulmicoton

We should already be retrying a few times (3?). Is that not working, or does the transient storage issue persist for a duration longer than our retry delay?

The number of retries was set to a low value because, on the search side, we wanted some quick feedback when a storage issue occurred.

We could have a dedicated retry policy for PUT requests.

guilload avatar Mar 25 '24 14:03 guilload

we already have retries on the storage for writes, up to 5 times. Some error were not retried, but now are since https://github.com/quickwit-oss/quickwit/pull/5384, so I think this ticket can be closed

trinity-1686a avatar Sep 04 '24 17:09 trinity-1686a

@trinity-1686a Do we retry on S3 internal errors?

fulmicoton avatar Oct 21 '24 01:10 fulmicoton

we retry based on what the sdk defines as transient and throttling errors, list here: https://docs.rs/aws-runtime/1.4.3/src/aws_runtime/retries/classifiers.rs.html#18-36 It doesn't include InternalError, so we don't retry on that

trinity-1686a avatar Oct 21 '24 07:10 trinity-1686a

we do retry upload now for all transient errors we could find. If more error conditions should be retried, that should be a separate ticket

trinity-1686a avatar Nov 05 '24 10:11 trinity-1686a