ncov-ingest
ncov-ingest copied to clipboard
Allow ncov-ingest to download from non-AWS remote file paths
Context
To support running ingest on Terra, we need to support downloading existing Nextclade alignments and metadata from remote storage other than AWS's S3.
Description
We need to support the definition of a "source" bucket in the workflow configuration YAML associated with at least S3 or GS URIs. This means changing the name of the configuration variable from s3_src
to a more generic name and modifying all of the logic in the workflow that refers specifically to downloading from S3 (e.g., "download_from_s3" script, etc.).
~As part of this work, we will also need to modify the Pipfile used to populate the ncov-ingest Docker image by adding the Python bindings for Google Cloud Storage. See the Dockerfile for the docker-base image.~ See @tsibley's comments below.
Examples
The modified ingest should continue to work with our production S3 buckets, but it should also work from GS buckets accessed through Terra.
As part of this work, we will also need to modify the Pipfile used to populate the ncov-ingest Docker image by adding the Python bindings for Google Cloud Storage. See the Dockerfile for the docker-base image.
This should not be necessary, as the nextstrain/ncov-ingest image is based on the nextstrain/base image. I believe the only reason the GCS Python bindings aren't available in the latest nextstrain/ncov-ingest image is that it predates (~3 Feb) the addition of the bindings to the nextstrain/base image (~11 Feb). I triggered an image update which is running now, and that should take care of bringing in the GCS bindings.
Works in the latest image now:
$ docker image ls --digests nextstrain/ncov-ingest
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
nextstrain/ncov-ingest latest sha256:2bb78caa7dfc38703724a5a36fa280744256d53f18c18c6e371a6c4048c77b65 d150b94d8afd 20 minutes ago 2.36GB
nextstrain/ncov-ingest <none> sha256:63bd513524c1e71eb1ce60c1fd4ac90d21ba443a43865c60062b96bf433e7536 d2f7e3fef415 3 months ago 2.4GB
$ docker run --rm -it nextstrain/ncov-ingest@sha256:2bb78caa7dfc38703724a5a36fa280744256d53f18c18c6e371a6c4048c77b65 python3 -c 'from google.cloud import storage'
# 👍 no error
$ docker run --rm -it nextstrain/ncov-ingest@sha256:63bd513524c1e71eb1ce60c1fd4ac90d21ba443a43865c60062b96bf433e7536 python3 -c 'from google.cloud import storage'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'google'