cli icon indicating copy to clipboard operation
cli copied to clipboard

[aws batch] Overlays (e.g. `--augur`) in uploaded workdir ZIP include too much

Open tsibley opened this issue 1 year ago • 6 comments

Elsewhere, @jameshadfield wrote:

Using --augur ~/github/nextstrain/augur (as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache etc).

I replied with some potential solutions:

Perhaps runner.aws_batch.s3.upload_workdir needs to be extended to a) use .gitignore files for the overlay volumes (but not the workdir itself) and/or b) support a Nextstrain-specific ignore file (e.g. .nextstrain-ignore) which could be applied everywhere (as suggested in 6d465f0).

I looked into this a bit more, and we could use git check-ignore (conditional on git being available) to filter paths for upload. It could be invoked one-at-a-time on each path, which would be simplest to integrate but slow, or be fed a stream of paths on stdin while we read its stdout concurrently, which would be more complex to integrate but fast. This seemed promising and would be transparent, Just Working™ without a new ignores file or intervention from anyone.

In some ad-hoc testing, though, I realized a big caveat: for overlay purposes, we actually need some of the files ignored by git, e.g. Python packaging metadata in nextstrain_augur.egg-info/ for Augur (for the installed augur to locate its entrypoint) and dist/ for Auspice (for the transpiled code served by auspice view). The way I realized this was by routinely deleting all ignored files with git clean -fXd and then noticing (to my surprise) that an overlay no longer worked.

This makes git ignores entirely inappropriate for use as excludes in overlays, I think. And for basically the same reason we wouldn't apply git ignores to workdirs themselves: build time artifacts not suitable for version control are important for execution time.

That leaves us with the new feature of Nextstrain-specific ignore files, which nicely enough can be applied to both overlay sources and workdirs alike. Implementation will require a fair bit of new complexity, but I don't see any major algorithm questions or uncertainty. Biggest questions are about design/interface, perhaps:

  1. What's the filename we use? ~.nextstrain-ignore?~ .nextstrain-exclude
  2. Is there anywhere else this ignore file would be used that we should be taking into account?
  3. If not (2), perhaps we should make the filename (1) more specific to AWS Batch or Nextstrain CLI?

And of course, maybe there's another option/solution to consider.

tsibley avatar Mar 18 '25 19:03 tsibley

@jameshadfield I'm interested in your thoughts here, as you'd be a primary user of AWS Batch with overlays, it seems.

tsibley avatar Mar 18 '25 19:03 tsibley

Thanks for the summary Tom - here are some initial thoughts, although I don't think it's a pressing priority. It seems nicely self contained, although #88 may touch similar bits of code re: walking local files.

.nextstrain-ignore is nice, assuming it's the same syntax as .gitignore. We could commit this to the Augur repo, and I could see it being used routinely in an ad-hoc fashion for pathogen repos and work-dirs to stop uploading unnecessary files, although we have to think through the behaviour if the workflow (re)creates those files and we do/don't overwrite them on download.

jameshadfield avatar Mar 18 '25 20:03 jameshadfield

assuming it's the same syntax as .gitignore

It would be similar and overlapping syntax, but not entirely the same. We'd use the same glob pattern syntax support by the existing --exclude-from-upload option. That, now that I think about it, suggests a file name of .nextstrain-exclude too.

Ah, yes, thinking thru the "does this exclude from only upload or also download?" question is necessary. Perhaps the simplest/least surprising answer would be "both", i.e. an entry X in .nextstrain-exclude is equivalent to:

--exclude-from-upload='X' --exclude-from-download='X'

tsibley avatar Mar 18 '25 20:03 tsibley

Implementation will require a fair bit of new complexity

Is it possible that we could leverage some existing library for this functionality? I strongly suspect (but have not researched) that something like this probably already exists…

genehack avatar Mar 20 '25 17:03 genehack

Maybe? But we'd be left integrating the library into our filesystem traversal since we upload-as-we-walk and with our globbing syntax so syntax in files matches that used by options. At that point it feels likely all we've done is shift complexity around, not reduced it or hidden it away.

tsibley avatar Mar 20 '25 18:03 tsibley

But yes, totally should do a prior art search for things that might be helpful.

tsibley avatar Mar 20 '25 18:03 tsibley