cli
cli copied to clipboard
[batch] exclude directories for upload
Context
Deploying a build to AWS-batch currently zips up (almost) the entire current directory and uploads this to S3. Certain assets are not uploaded, e.g. .git/, but these are not user-definable. I commonly have lots of large files which are unnecessary for the current build and it would be quicker and less overhead to ignore them for the purposes of the desired build. As examples: ./data (sometimes >100Gb), big JSONs in ./auspice (can take more than an hour to zip), and a huge number of intermediate files in ./results.
This causes pain during job deployment (both time to zip up, and time to upload the zip file) as well as during download (as the zip file is bigger than necessary).
My current workflow often involves creating a "temporary" directory which is a copy of the the current one without those directories, or moving the directories to a temporary place while the job is deployed. Both are a pain and prone to messing up badly!
Description
Add a --exclude-from-upload flag, which I'd commonly use use like so: --exclude-from-upload auspice results data
I'm not sure how this would work when downloading if you had a previous auspice/ directory which wasn't part of the uploaded assets, and then downloaded a (aws-batch generated) auspice/ directory. I'm not that familiar with the logic for working out which files to download (I typically request a subset of files to be downloaded).
I imagine we could reuse a lot of the code behind the --download functionality here
if supplied_exclude_patterns:
force_exclude = glob_matcher(patterns)
else:
force_exclude = lambda path: False
hardcoded_excluded = path_matcher([
".git/", ...
])
excluded = lambda path: hardcoded_excluded(path) or force_exclude(path)
Edit: I misunderstood what the issue was about, so the struck out stuff below isn't applicable
~I agree that we don't need to update things that are just downloaded - like data, but results are often useful for debugging or quickly rerunning a rule locally without having to rerun time-intensive steps.~
~Also, we may reduce storage by tar.zst compressing instead of using standard zip. Since we have lots of sequences zstd compression would work extremely well.~
but results are often useful for debugging or quickly rerunning a rule locally without having to rerun time-intensive steps
Yup, I agree, but I think we're thinking of slightly different things. This is a request to avoid uploading certain artefacts via an opt-in process -- e.g. in my case ./results from previous runs which aren't relevant to the current job I want to deploy.
I think the confusion may be that @jameshadfield is talking about the upload from the local computer → S3 before the Batch job is submitted and @corneliusroemer is talking about the upload from the remote Batch job → S3 at workflow completion.
Using --aws-batch for the first time from a local directory (rather than in CI) I noticed the exact same thing: previous results and data get zipped which takes long (~10min in my case). So definitely +1 for this feature request.
A good standard option could be to use .gitignore to determine what to ignore. That usually comprises input and output data that gets pulled/produced by other workflows. This wouldn't allow much customization, so --exclude-from-upload may still be needed as well.
The intention is for previous results and input data to be included by default so that the AWS Batch runner works the same way as the other runners, preserving the consistent interface of nextstrain build, enabling use of partial remote builds, etc. Using .gitignore would change that behaviour, though, as the results and input files produced by the workflow are almost always ignored from version control. I think it's important for nextstrain build to have consistent behaviour.
I do think that there should be a way to specify ignore patterns, though, probably both via a command-line option and via an ignores file which can live with the workflow.
More generally, I'd like to separate out the workflow as program (rules, code, etc.) from the workflow as state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this issue too. For example, instead of relying on the implicit state combined with the code like now, you'd explicitly pass an empty starting state separate from the workflow program. (Which is effectively what James is doing in his workaround described above.)
We just got a +1 for this functionality from @lmoncla since she often has large subdirectories in her top-level flu build directories that she doesn't want to upload during AWS Batch jobs especially when she has a poor internet connection.
+1 from @lmoncla and @trvrb in a Slack thread, along with some feature design discussion.
Working prototype: https://github.com/nextstrain/cli/commit/ab8b6d1f89506b005b27c6fddb0520936846e7b5
That'll be the basis for a PR sometime this week (when exactly depends on meetings and other work).