Feature request: S3 file access
Is S3 file access already supported? I could not find it mentioned in the documentation.
It was previously mentioned in this issue;
https://github.com/common-workflow-language/cwltool/issues/539
However that one was closed after http access was implemented. But as far as I can tell, this is not sufficient when you need to supply S3 access key and secret key to access the files.
As an example, Nextflow had this featured implemented, described here; https://www.nextflow.io/docs/latest/amazons3.html
so I was hoping to find an equivalent in cwltool
Thanks!
S3 is not supported in cwltool. However other CWL runners do support S3, check out toil-cwl-runner:
https://github.com/DataBiosphere/toil
However if you or someone is interested in adding S3 support to cwltool directly, it would be pretty easy. Here's where the file download happens:
https://github.com/common-workflow-language/cwltool/blob/main/cwltool/pathmapper.py#L142
https://github.com/common-workflow-language/cwltool/blob/ac60dc1df0c23e54ecee99bc0d989da410851d2e/cwltool/utils.py#L426
So you could do something like import boto3 and add a downloadS3file function when it sees s3 URLs.
The http support uses the CacheControl library for local caching so files are not re-downloaded for every run, you probably want something similar for s3.
We talked about this before, but maybe it needs better documenting the policy: As cwltool is the reference implementation it should not support transport methods that aren't required by the standard.
should the standard be updated to include support for S3 protocol? It seems pretty widespread these days
@mr-c I feel you could say the same thing about container runtimes and the parallel executor and various other features of cwltool. It doesn't need to be part of the spec. We want it to be useful.
To be clear I'm also not suggesting anything beyond the bare minimum of what cwltool already does for plain http, which is to download files to the local FS right at the start.