pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
Separate file transfer from recipe execution
Currently, pangeo forge recipes explicitly handle the downloading (called "caching") of files from an external storage service (e.g. downloading a big list of files from HTTP or FTP). This happens in the cache_input loop.
However, some data provides (e.g. NCAR; https://github.com/pangeo-forge/staged-recipes/issues/8#issuecomment-813779782) want us to transfer data using Globus. Globus works fundamentally differently from how Pangeo Forge is currently implemented. Right now, Pangeo Forge manages all downloads explicitly, open HTTP / FTP / etc connections to servers and downloading data directly.
Globus instead is essentially a Transfer-as-a-Service tool. We queue up a transfer and then Globus handles the actual movement of data using their system. Typically we would want to move 1000 netCDF files from NCAR glade to S3. When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.
More generally, there could be other reasons to separate these steps. For example, downloading over HTTP from a slow server will not necessarily benefit from parallelism; if we launch 100 simulataneous download requests, we may just end up clobbering the poor HTTP server. Instead, a pub / sub model might work better for this. We could push files we want to download into a message queue, and a downloading service consumes these.
Question for @pangeo-forge/dev-team: should we consider removing the cache_inputs step from the recipe itself and moving this to a more flexible, standalone component?
So both Google Cloud and AWS provide "file transfer services" to / from cloud storage
- AWS has several options; not sure which is best
- https://aws.amazon.com/datasync
- https://aws.amazon.com/aws-transfer-family/
- https://aws.amazon.com/s3/transfer-acceleration/
- GCS Storage Transfer (only supports HTTP)
Seems like we could pretty easily swap these out.
When If we rely on Globus, we could basically eliminate the cache_input loop and just wait for the globus transfer to complete. However, we would still need a way to defer the recipe execution until after the transfer is complete. That would require some fancy CI logic.
This sounds a bit over-complicated, perhaps because I don't understand Globus. Can we manually start the Globus transfer, and then just write the recipe as if its source location is S3 (or whatever the destination of the transfer is?) This does harm the reproducibility of the recipe a bit, as it relies on this intermediate S3 store that's ephemeral.
And then the recipe can choose to skip caching, since it's already in object storage.
I did some reading about the library parsl and its support for file transfer / staging. https://parsl.readthedocs.io/en/stable/userguide/data.html#staging-data-files
It seems to have a pretty flexible system for file staging which includes both http / ftp and Globus.
In fact, looking through the parsl docs more broadly, it looks like a really useful thing for Pangeo forge. We could probably implement a parsl executor fairly easily. This could be particularly useful to NCAR folks, since parsl plays very well with HPC. They also apparently support cloud-based execution...so perhaps even an alternative to Prefect if we really get stuck.