miniwdl
miniwdl copied to clipboard
More resilient s3 file download
Motivation
...
When multiple workflows are running concurrently and attempting to access the same objects from S3 you will occasionally get throttled or disconnections. This can also happen when too many download containers are running concurrently on a single host and saturate the network bandwidth of the host.
Approach
... I have added a retry with backoff loop that will also verify the expected and final length of the download (on very rare occasions I have seen incomplete downloads probably due to a silent network problem).
I also use aws sts get-caller-identity
to determine if the container has access to credentials. The call get-caller-identity
is always allowed regardless of IAM permissions and will exit with rc != 0 if the caller doesn't have a valid identity. It has two advantages, it uses the full credential chain to attempt to find credentials and it verifies that any credentials found are still valid.
Checklist
@markjschreiber thanks for this, a few comments:
- Can you confirm that trivial 404 or access-denied errors still "fail fast", i.e. won't enter into the exponential backoff loop. (UX consideration for user who fat-fingered the URI or misconfigured their IAM)
- There's this annoying case where an IAM role granting access to selected S3 objects implicitly denies access to everything else including public objects. That is, you can can be denied access to an object using your valid credentials, but then add
--no-sign-request
and get to it! - I like the content length check, surprised
aws s3 cp
does not do this internally. - double-checking you using the
[scheduler] download_concurrency
setting to limit the number of concurrent download tasks. Since it does not take many to get to full throughput on the shared file system.
- Yes, 404 errors (or indeed any
HeadObject
error) will cause theHeadObject
call which gets the file size to fail which will break out of the loop so it fails fast. I have tested this. This is a slight weakness in that theHeadObject
call is not retried although I have not yet seen that fail at scale. Presumably because it is a lot lighter weight thanGetObject
. - This is an interesting quirk. We could add back the
--no-sign-request
as a fall back. It feels a bit like working around a poorly authored IAM policy though and you might end up exposing objects that the policy author was actually trying to protect (but incorrectly). Let me know if you want me to add that back in. - Me too. I'm not sure of the root of the problem, it might be an edge case where a check that normally happens doesn't happen under extreme conditions. Perhaps a multipart download where all parts are downloaded with expected md5 but then not correctly assembled??
- This function should only ever spawn a single download attempt at a time. As long as the original attempt was limited to the
download_concurrency
this one should be the same. How do you recommend verifying?
Mark, sorry I was tied up for a few days:
Yes, the pitfall where we might have to fall back to --no-sign-request
to access a public object is one miniwdl users have encountered before, so omitting it would be a regression unfortunately. Separately, I should add a regression test with some kind of mock s3 (although I wonder if mock IAM models this corner case!)
Re the download_concurrency
setting, it's a scheduler setting akin to task_concurrency
but specific to these S3 download operations. If you start a workflow with 1,000 input files then it limits how many it tries to localize at once. It's set to 10 in the template configuration file where you recently made a separate adjustment, so I assume that's effective but it's worth double-checking.
Thanks!
@markjschreiber I'm thinking that instead of writing all the aws s3 cp
resiliency features in the shell+WDL+python code here, it'd make sense to capture them in a separate wrapper script/program that can be baked into the relevant docker image and invoked briefly here. The wrapper would probably be useful in many other projects, and we could write separate tests for it (and it wouldn't necessarily need to be shell).
- HEAD object first to check access and detect object size
- follow-up attempt with
--no-sign-request
if access was denied (in case the object is public) - check object size after transfer
- retry w/ backoff in the event of a transfer error (but not in case of 403 or 404)
WDYT?
@markjschreiber I'm thinking that instead of writing all the
aws s3 cp
resiliency features in the shell+WDL+python code here, it'd make sense to capture them in a separate wrapper script/program that can be baked into the relevant docker image and invoked briefly here. The wrapper would probably be useful in many other projects, and we could write separate tests for it (and it wouldn't necessarily need to be shell).* HEAD object first to check access and detect object size * follow-up attempt with `--no-sign-request` if access was denied (in case the object is public) * check object size after transfer * retry w/ backoff in the event of a transfer error (but not in case of 403 or 404)
WDYT?
I agree, externalizing it from the Python code would be cleaner and more maintainable. Something that should be done for all the engines AGC uses.
The flow you suggest makes sense. We'd need to check if there are situations where permissions don't allow finding an objects size even if you are allowed to GetObject
and have contingencies.