miniwdl icon indicating copy to clipboard operation
miniwdl copied to clipboard

More resilient s3 file download

Open markjschreiber opened this issue 2 years ago • 5 comments

Motivation

...

When multiple workflows are running concurrently and attempting to access the same objects from S3 you will occasionally get throttled or disconnections. This can also happen when too many download containers are running concurrently on a single host and saturate the network bandwidth of the host.

Approach

... I have added a retry with backoff loop that will also verify the expected and final length of the download (on very rare occasions I have seen incomplete downloads probably due to a silent network problem).

I also use aws sts get-caller-identity to determine if the container has access to credentials. The call get-caller-identity is always allowed regardless of IAM permissions and will exit with rc != 0 if the caller doesn't have a valid identity. It has two advantages, it uses the full credential chain to attempt to find credentials and it verifies that any credentials found are still valid.

Checklist

  • [ ] Add appropriate test(s) to the automatic suite
  • [*] Use make pretty to reformat the code with black
  • [*] Use make check to statically check the code using Pyre and Pylint
  • [*] Send PR from a dedicated branch without unrelated edits
  • [*] Ensure compatibility with this project's MIT license

markjschreiber avatar Dec 07 '21 20:12 markjschreiber

@markjschreiber thanks for this, a few comments:

  1. Can you confirm that trivial 404 or access-denied errors still "fail fast", i.e. won't enter into the exponential backoff loop. (UX consideration for user who fat-fingered the URI or misconfigured their IAM)
  2. There's this annoying case where an IAM role granting access to selected S3 objects implicitly denies access to everything else including public objects. That is, you can can be denied access to an object using your valid credentials, but then add --no-sign-request and get to it!
  3. I like the content length check, surprised aws s3 cp does not do this internally.
  4. double-checking you using the [scheduler] download_concurrency setting to limit the number of concurrent download tasks. Since it does not take many to get to full throughput on the shared file system.

mlin avatar Dec 10 '21 04:12 mlin

  1. Yes, 404 errors (or indeed any HeadObject error) will cause the HeadObject call which gets the file size to fail which will break out of the loop so it fails fast. I have tested this. This is a slight weakness in that the HeadObject call is not retried although I have not yet seen that fail at scale. Presumably because it is a lot lighter weight than GetObject.
  2. This is an interesting quirk. We could add back the --no-sign-request as a fall back. It feels a bit like working around a poorly authored IAM policy though and you might end up exposing objects that the policy author was actually trying to protect (but incorrectly). Let me know if you want me to add that back in.
  3. Me too. I'm not sure of the root of the problem, it might be an edge case where a check that normally happens doesn't happen under extreme conditions. Perhaps a multipart download where all parts are downloaded with expected md5 but then not correctly assembled??
  4. This function should only ever spawn a single download attempt at a time. As long as the original attempt was limited to the download_concurrency this one should be the same. How do you recommend verifying?

markjschreiber avatar Dec 10 '21 14:12 markjschreiber

Mark, sorry I was tied up for a few days:

Yes, the pitfall where we might have to fall back to --no-sign-request to access a public object is one miniwdl users have encountered before, so omitting it would be a regression unfortunately. Separately, I should add a regression test with some kind of mock s3 (although I wonder if mock IAM models this corner case!)

Re the download_concurrency setting, it's a scheduler setting akin to task_concurrency but specific to these S3 download operations. If you start a workflow with 1,000 input files then it limits how many it tries to localize at once. It's set to 10 in the template configuration file where you recently made a separate adjustment, so I assume that's effective but it's worth double-checking.

Thanks!

mlin avatar Dec 17 '21 20:12 mlin

@markjschreiber I'm thinking that instead of writing all the aws s3 cp resiliency features in the shell+WDL+python code here, it'd make sense to capture them in a separate wrapper script/program that can be baked into the relevant docker image and invoked briefly here. The wrapper would probably be useful in many other projects, and we could write separate tests for it (and it wouldn't necessarily need to be shell).

  • HEAD object first to check access and detect object size
  • follow-up attempt with --no-sign-request if access was denied (in case the object is public)
  • check object size after transfer
  • retry w/ backoff in the event of a transfer error (but not in case of 403 or 404)

WDYT?

mlin avatar Mar 29 '22 08:03 mlin

@markjschreiber I'm thinking that instead of writing all the aws s3 cp resiliency features in the shell+WDL+python code here, it'd make sense to capture them in a separate wrapper script/program that can be baked into the relevant docker image and invoked briefly here. The wrapper would probably be useful in many other projects, and we could write separate tests for it (and it wouldn't necessarily need to be shell).

* HEAD object first to check access and detect object size

* follow-up attempt with `--no-sign-request` if access was denied (in case the object is public)

* check object size after transfer

* retry w/ backoff in the event of a transfer error (but not in case of 403 or 404)

WDYT?

I agree, externalizing it from the Python code would be cleaner and more maintainable. Something that should be done for all the engines AGC uses.

The flow you suggest makes sense. We'd need to check if there are situations where permissions don't allow finding an objects size even if you are allowed to GetObject and have contingencies.

markjschreiber avatar Mar 29 '22 13:03 markjschreiber