reana icon indicating copy to clipboard operation
reana copied to clipboard

IDEA: upload remote files

Open agy-why opened this issue 1 year ago • 5 comments

Dear developers,

I have a question regarding the reana-client upload feature.

Is it already possible or is it planned to execute something like:

reana-client upload s3://.../my-s3-data/

And if yes which services are you currently supporting? scp, ftp, s3, google cloud, webdav,...

I thank you in advance,

Yori

agy-why avatar Aug 29 '22 11:08 agy-why

I realized this may not be the proper place to ask. Shall I open this issue for reana-client ?

agy-why avatar Aug 31 '22 13:08 agy-why

Hi @agy-why, this repository is a perfect location for this issue, there is no need to move it.

Currently, we don't support remote storage services in the above suggested way. What is possible is that the researchers can express remote file access needs by special stage-in and stage-out steps in their computational workflow graphs. That is, the first step of the workflow would be the download of inputs from S3, and the last step of the workflow would be uploads of results back to S3. For a live example, please see EOS stage-out example in the documentation: https://docs.reana.io/advanced-usage/storage-backends/eos/

We support virtually any external storage system where we can use Kerberos authentication or VOMS proxy authentication mechanisms. Examples include EOS or WLCG sites. Note also that we are in the middle of adding support for Rucio, see https://github.com/reanahub/reana-auth-rucio/issues/1

That said, we have been planning to support remote file syntax sugar in a rather similar way as you suggested. We thought of allowing a syntax like:

inputs:
  files:
    - s3(`mybucket`, `myfile.csv`)

REANA would then do an automatic stage-in and stage-out for this file. One advantage is that researchers wouldn't have to write explicit data staging steps in their DAG workflows.

This is a bit similar to Snakemake support for remote storage, see https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html and the examples therein for AWS or S3 in Snakemake rules.

We hope to start working on similar remote file storage support syntax sugar sometime this winter.

tiborsimko avatar Sep 01 '22 14:09 tiborsimko

P.S. Another related idea I should note that we have been thinking about is to add support for popular protocols so that REANA workspace could be manipulated via tools such as rclone. This might simplify initial stage-in upload and final stage-out download, especially when using many files or when using very large files.

tiborsimko avatar Sep 01 '22 14:09 tiborsimko

Dear Tibor,

thank you for your clear and detailed response.

My personal use-case would be to have a single workflow that could work with various data origins: my dev-data are on a server that I can access via scp, my prod-data are on a private s3 infrastructure but they may move to another one (not necessarily s3) after publication of the results.

Therefore I would found useful to be able to specify not only the source but also the protocol to access the data outside of the workflow (git repo).

Currently, I need to implement two variants in my first step (get_data) to get the data in the work space, which I can chose via input parameters. It is fully acceptable that way, but I'd greatly appreciate the rclone feature you suggested.

This would allow me to plug-in / plug-out input data to the same workflow by populating my workspace accordingly.

agy-why avatar Sep 05 '22 07:09 agy-why

An alternative would be to be able to mix workflows together, I don't know how far this is possible.

I have:

  • a git-repo with workflow get_data_from_scp
  • a git-repo with workflow get_data_from_s3
  • a git-repo with workflow analyse_data

It would an acceptable solution for me to be able to propagate the WorkSpace of one of the get_data workflow to the analyse_data WorkSpace. Or to create a new workflow from, say: get_data_from_scp + analyse_data.

Is this already possible?

I thank you in advance.

agy-why avatar Sep 05 '22 07:09 agy-why