amazonka icon indicating copy to clipboard operation
amazonka copied to clipboard

S3 throttling causes transport errors not service errors

Open nhibberd opened this issue 8 years ago • 4 comments

S3 throttles differently to other services, in that (unless there is special code to handle 100 CONTINUE) it tends to accept connections then terminate them at some random point for "throttling" if required. My understanding is the reason this happens is because downloading largish things from S3 happens and they don't know that something actually needs to be throttled until things are inflight. This unfortunately results in very unclean termination, generally with NoResponseDataReceived, TlsException or FailedConnectionException2 based on when it happened to terminate. Below is a retry handler that works for S3 retries (excluding #229 - it had worked on our patched-up 0.3.* era code), it is possibly that the StatusCodeException is redundant because of other things, but the other ones are essential to handling heavy S3 workloads.

retryAWS :: Int -> Env -> Env
retryAWS i e = e & envRetryCheck .~ err
  where
    err c _ | c >= i = False
    err c v = case v of
      NoResponseDataReceived -> True
      StatusCodeException s _ _ -> s == status500
      FailedConnectionException _ _ -> True
      FailedConnectionException2 _ _ _ _ -> True
      TlsException _ -> True
      _ -> (e ^. envRetryCheck) c v

nhibberd avatar Sep 28 '15 11:09 nhibberd

Pondering this - I'll probably add a variant of the above directly to S3's src/ (non-generated code) and then allow overriding the default service configuration retry logic to use an existing function instead of rendering one.

brendanhay avatar Sep 29 '15 06:09 brendanhay

I would (also) suggest to add 100-continue to POST/PUT requests by default, http-client handles this automatically. It appears that, if 100-continue is not used, http-client has no way of knowing when to remove a connection from it's pool which was closed on the remote side. Thus, it will reuse half-open connections, stream the request body into the socket buffer and then hang indefinitely. For some reason, using 100-continue by default also increases throughput significantly, at least within AWS.

kim avatar Sep 29 '15 07:09 kim

I've added a the 100-continue header via #235. Revisting the retries - since the Service configuration itself only deals with ServiceError, not HttpException, I'd need to make this logic the default for all service retries. Otherwise, the Retry type in core will need to be updated accordingly.

brendanhay avatar Oct 08 '15 06:10 brendanhay

Are there any reasons not to retry 500 by default?

domenkozar avatar Nov 13 '18 18:11 domenkozar