cwltool icon indicating copy to clipboard operation
cwltool copied to clipboard

Feature request: S3 file access

Open stevekm opened this issue 5 years ago • 5 comments

Is S3 file access already supported? I could not find it mentioned in the documentation.

It was previously mentioned in this issue;

https://github.com/common-workflow-language/cwltool/issues/539

However that one was closed after http access was implemented. But as far as I can tell, this is not sufficient when you need to supply S3 access key and secret key to access the files.

As an example, Nextflow had this featured implemented, described here; https://www.nextflow.io/docs/latest/amazons3.html

so I was hoping to find an equivalent in cwltool

Thanks!

stevekm avatar Dec 10 '20 17:12 stevekm

S3 is not supported in cwltool. However other CWL runners do support S3, check out toil-cwl-runner:

https://github.com/DataBiosphere/toil

However if you or someone is interested in adding S3 support to cwltool directly, it would be pretty easy. Here's where the file download happens:

https://github.com/common-workflow-language/cwltool/blob/main/cwltool/pathmapper.py#L142

https://github.com/common-workflow-language/cwltool/blob/ac60dc1df0c23e54ecee99bc0d989da410851d2e/cwltool/utils.py#L426

So you could do something like import boto3 and add a downloadS3file function when it sees s3 URLs.

The http support uses the CacheControl library for local caching so files are not re-downloaded for every run, you probably want something similar for s3.

tetron avatar Dec 10 '20 18:12 tetron

We talked about this before, but maybe it needs better documenting the policy: As cwltool is the reference implementation it should not support transport methods that aren't required by the standard.

mr-c avatar Dec 10 '20 18:12 mr-c

should the standard be updated to include support for S3 protocol? It seems pretty widespread these days

stevekm avatar Dec 11 '20 00:12 stevekm

@mr-c I feel you could say the same thing about container runtimes and the parallel executor and various other features of cwltool. It doesn't need to be part of the spec. We want it to be useful.

tetron avatar Dec 11 '20 01:12 tetron

To be clear I'm also not suggesting anything beyond the bare minimum of what cwltool already does for plain http, which is to download files to the local FS right at the start.

tetron avatar Dec 11 '20 01:12 tetron