rudolfs icon indicating copy to clipboard operation
rudolfs copied to clipboard

More robust S3 response code handling

Open port8080 opened this issue 1 year ago • 1 comments

First of all: thanks very much for your work. I have been using rudolfs and have managed to upload ~100GB over ~20K files.

While I was doing that, I started noticing errors which only went away when I did git config --global lfs.concurrenttransfers 1

This made me realize that you probably have to add more logic to cover over Rusoto's error handling. You already mention this in your comments:

                // Rusoto really sucks at correctly reporting errors.
                // Lets work around that here.

What I suspect is happening is that S3 is returning a 503 slow-down which is getting lumped into RusotoError::Unknown(_) and I think you are handling it like every other transient error.

In our own code, we have done something like this:

/// Roughly inferred what makes sense from
/// https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList
 pub fn s3_status_code_is_permanent(s: StatusCode) -> bool {
     s == StatusCode::BAD_REQUEST || // 400 
     s == StatusCode::UNAUTHORIZED || // 401
     s == StatusCode::FORBIDDEN || // 403
     s == StatusCode::NOT_FOUND // 404
 }

and then

  .map_err(|e| match e {
            RusotoError::Unknown(ref r) => {
                 if r.status == StatusCode::NOT_FOUND {
                     Error::NotFound
                 } else {
                     error!("S3 Failure on key {}, Err {:?}", key, e);
                     e.into()
                 }
             }
             _ => {
                 error!("S3 Failure on key {}, Err {:?}", key, e);
                 e.into()
             }   

port8080 avatar Jan 08 '23 19:01 port8080

I started noticing errors

Can you provide the error message? Did it happen during upload or download? Is it an error from the gitlfs client? Or is it from Rudolfs?

Even with these errors, does the operation still complete successfully?

What I suspect is happening is that S3 is returning a 503 slow-down which is getting lumped into RusotoError::Unknown(_) and I think you are handling it like every other transient error.

All upload requests to S3 are retried with exponential backoff, so if S3 returns a 503 error, the request should get retried automatically.

After looking at this code again, I do see that downloads are not retried, so this could be what you're seeing. I think the GitLFS client should automatically retry the download if Rudolfs fails to download because of a transient S3 error, but I'm not 100% sure.

If Rudolfs is unnecessarily retrying in the event of HTTP 400 errors (client errors), then it'll just take a little longer to fail, but the correctness is still there. Still, that should be fixed -- it's always good to fail faster.

jasonwhite avatar Jan 10 '23 02:01 jasonwhite