dvc icon indicating copy to clipboard operation
dvc copied to clipboard

new command: put-url OR rsync/rclone

Open casperdcl opened this issue 3 years ago • 34 comments

Summary

An upload equivalent of dvc get-url.

We currently use get-url as a cross-platform replacement for wget. However, together with get-url, put-url will turn DVC into a replacement for rsync/rclone.

Motivation

  • we already have get-url so adding put-url seems natural for the same reasons
  • put-url will be used by
    • CML internally to sync data
    • LDB internally to sync data
    • the rest of the world
  • uses existing functionality of DVC so should be fairly quick to expose
  • cross-platform multi-cloud replacement for rsync/rclone. What's not to love?
    • could even create a spin-off thin wrapper (or even abstract the functionality) in a separate Python package

Detailed Design

usage: dvc put-url [-h] [-q | -v] [-j <number>] url targets [targets ...]

Upload or copy files to URL.
Documentation: <https://man.dvc.org/put-url>

positional arguments:
  url                   Destination path to put data to.
                        See `dvc import-url -h` for full list of supported
                        URLs.
  targets               Files/directories to upload.

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  -j <number>, --jobs <number>
                        Number of jobs to run simultaneously. The default
                        value is 4 * cpu_count(). For SSH remotes, the default
                        is 4.

How We Teach This

Drawbacks

  • can't think of any

Alternatives

  • would have to re-implement per-cloud sync options for CML & other products

Unresolved Questions

  • minor implementation details
    • CLI naming (put-url)?
    • CLI argument order (url targets [targets...])?
    • Python API (dvc.api.put_url())?

Please do assign me if happy with the proposal.

(dvc get-url + put-url = dvc rsync :))

casperdcl avatar Aug 24 '21 11:08 casperdcl

(dvc get-url + put-url = dvc rsync :))

Does it make sense to have a separate put-url command strictly for uploads, or would it be better to have a combined transfer command? This has both UI and usage implications, and a combined transfer command could enable:

  • Transfers that never touch local disk. Downloading files just to upload them could be a waste of time or even impossible if disk space is insufficient.
  • dvc get equivalent functionality where remote data in dvc/git (for example, a data/model registry) can be transferred to cloud destinations without cloning the repo or pulling the data.
* Python API (`dvc.api.put_url()`)?

dvc get-url and dvc.api.get_url() don't really do the same thing unfortunately, so I'm unclear what dvc.api.put_url() would do? That confusion might be another reason to prefer a two-way transfer command over separate download and upload commands.

dberenbaum avatar Aug 24 '21 21:08 dberenbaum

dvc get-url and dvc.api.get_url() don't really do the same thing

:eyes: well that sounds like a problem

would it be better to have a combined transfer command?

Maybe this wouldn't be too difficult to implement. Summarising what I think the most useful rsync options may be:

dvc rsync [-h] [-q | -v] [-j <number>] [--recursive] [--ignore-existing]
          [--remove-source-files] [--include <pattern>] [--exclude <pattern>]
          [--list-only] src [src...] dst

We've endeavoured to provide buffered read/write methods for all remote types so that we can show progress... So simply gluing said BufferedIO methods together should provide the desired functionality.

casperdcl avatar Aug 25 '21 21:08 casperdcl

👀 well that sounds like a problem

😬 Yup, I only realized this pretty recently myself. Extracted to #6494.

dberenbaum avatar Aug 26 '21 15:08 dberenbaum

The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc file) to it for future downloads. This command (and its download equivalent) are supposed to work from no-dvc repositories - it means, no dependencies to .dvc/ dir.

It seems like, we need an "upload equivalent" of import-url, not get-url.

Proposed names: dvc export url [--no-exec] FILE URL and renaming some of the existing commands:

  • dvc import-url rename to dvc import url
  • dvc import rename to dvc import dvc

--no-exec option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file. In the case of downloading, the pointer file should have empty checksums.

New commands:

  • dvc export url FILE URL - the simplest way to upload a file and preserve a pointer file to it
  • dvc export model FILE URL - with a set of options to specify meta info for the model (description, model type, input/output type, etc)
  • dvc export data FILE URL - with a set of options to specify meta info for the data (description, data type, column name for structured, classes distribution for unstructured, etc... )
  • dvc export dvc FILE URL (do we need this one?)
  • dvc import model URL FILE
  • dvc import data URL FILE

PS: it is related to the idea of lightweight model and data management

dmpetrov avatar Dec 26 '21 11:12 dmpetrov

The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc file) [...] It seems like, we need an "upload equivalent" of import-url, not get-url.

export url sounds like a different feature request & use case to me. put url meanwhile is meant to work like get url, i.e. no pointer files/metadata.

I also think put url is a prerequisite to export url just like get url is a prerequisite to import url.

casperdcl avatar Dec 27 '21 07:12 casperdcl

put url is a good idea and I hope it will be implemented as a part of this initiative. But we should understand that this is not a part of the requirements.

I'd suggest focusing on the core scenario - export url - that is required for CML and MLEM.

dmpetrov avatar Dec 27 '21 08:12 dmpetrov

the simplest way to upload a file and preserve a pointer file to it

does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?

It sounds then than this should be an extension to import itself (move a file before import, or something). I don't feel that this deservers the full name "export" since it won't be doing anything different from import-url.

shcheklein avatar Dec 27 '21 15:12 shcheklein

My original intention with this feature request was just the first bit (aws s3 cp local_path_in_repo s3://remote_path, or even just aws s3 cp local_path_NOT_in_repo s3://remote_path). I don't have any strong opinions about the other import/export functionality & API that can be built on top.

casperdcl avatar Dec 27 '21 15:12 casperdcl

My original intention with this feature request was just the first bit

Yep, and name put makes sense for that.

shcheklein avatar Dec 27 '21 15:12 shcheklein

My original intention with this feature request was just the first bit

Yep, and name put makes sense for that.

I thought that we are solving CML, MLEM problems with this proposal, aren't we? If not, I'm creating a separate issue for the integrations and we can keep the current one as is.

dmpetrov avatar Dec 27 '21 21:12 dmpetrov

If not, I'm creating a separate issue for the integrations and we can keep the current one as is.

I'm fine with this one :)

My questions stays - it doesn't feel like it's export. It does copy + import under the hood, right? So why export then? Why not an option for the import command (to copy artifact to the cloud first)?

shcheklein avatar Dec 27 '21 21:12 shcheklein

@shcheklein yes, it can be just an option of import like dvc import url/model/data --upload.

import/export naming is definitely not perfect. So, an option might be a safer choice in the short term.

dmpetrov avatar Dec 27 '21 21:12 dmpetrov

renaming some of the existing commands

See https://github.com/iterative/dvc/issues/6494#issuecomment-906600254 for a proposed syntax to cover both dvc urls and arbitrary urls with one command.

--no-exec option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file.

--no-exec exists today in import and import-url to create the .dvc file (with the hash) without downloading. Seems like we need separate flags to skip the check for existence (and adding the hash) and the download.

I'd suggest focusing on the core scenario - export url - that is required for CML and MLEM.

My original intention with this feature request was just the first bit (aws s3 cp local_path_in_repo s3://remote_path, or even just aws s3 cp local_path_NOT_in_repo s3://remote_path). I don't have any strong opinions about the other import/export functionality & API that can be built on top.

Seems like it's unclear whether CML needs put, export, or both. What are the CML use cases for each?

does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?

Hm, I thought export would differ from import in that updates would always be uploaded from local to remote (instead of downloading from remote to local). Example workflow:

  • Data scientist generates model in local/repo/model.h5 as part of model development.
  • Data scientist uploads via dvc export model.h5 s3://delivery_bucket/model.h5 and tells engineers who consume it for deployment without any DVC knowledge.
  • Data scientist updates model in local/repo/model.h5.
  • Data scientists uploads updated model via dvc update and notifies engineers that new model is available for deployment at s3://delivery_bucket/model.h5.

dberenbaum avatar Jan 03 '22 21:01 dberenbaum

Hm, I thought export would differ from import in that updates would always be uploaded from local to remote

It might make sense to name it export then. That's why I was asking about the semantics of it. From the previous discussions (Dmitry's and Casper's) I didn't get it exactly.

shcheklein avatar Jan 03 '22 22:01 shcheklein

I'm trying to aggregate our discussions here and in person to action points:

  1. [Must-have] dvc export that should upload a local file to a cloud and preserve a link (.dvc file) similar to result of dvc import-url.
  2. [Nice-to-have] dvc put-url. It is not a part of use cases (see below) but something like this needs to work under the hood of dvc export anyway. And it might be handy for other scenarios.
  3. [Nice-to-have] dvc import-url ---etags-only (--no-exec but it gets etags from cloud) and/or dvc update --etags-only. This is needed to track file statuses when file is not downloaded locally.

Important:

  • All these commands have to support not-DVC environment. And even not-Git environment.
  • All these commands have to support directories since a model might be a directory (this might be postpone for a later iteration).

Below are user use cases that should help to understand the scenarios.

From local to Cloud/S3

A model out/model.h5 is saved in a local directory: local machine or cloud/TPI or CML, it might be DVC/Git or just a directory like ~/. The model needs to be uploaded to a specified place/url in a cloud/S3. User needs to keep the pointer file (.dvc) for future use.

Why user needs the pointer file:

  • for a record /linage
  • for 3rd party tool (deployment for example) or dvc get to download the file
  • to check status - if the file was changed

Uploading

$ dvc export out/model.h5 s3://mybucket/ml/prod/my-model.h5
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Note, This command is an equivalent to aws s3 cp file s3://path && dvc import-url s3://path file. We can consider introducing a separate command to cover the copy part in cross-cloud way - dvc put-url. However, the priority is not high in the context of the scenario.

Updating

A model file was changed (as a result of re-training) for example:

$ dvc update out/model.h5.dvc # It should work now if the Uploading part is based on `import-url`
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'File was changed in S3'

From cloud to workspace

Users write models/data to cloud from user's code (or it is already updated by an external tool). Saving pointer to a model file still might be useful. Why:

  • for a record /linage
  • for 3rd party tool (deployment for example) or dvc get to download the file
  • to know how to updated it if models changes

Tracking a cloud file

After training is done and a file saved to s3://mybucket/ml/prod/2022-03-07-model.h5:

$ dvc import-url s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Tracking a cloud file without a local copy

In some cases, user does writes a file in a storage and does not need a copy in workspace. dvc import-url --no-exec seems like a good option to cover this case.

$ dvc import-url --no-exec s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Technically, the file will still have a virtual representation in the workspace as my-model.h5. However, it won't be materialized until dvc update my-model.h5.dvc is called.

Pros/Cons:

  • [Pros] Is it consistent with the existent dvc commands.
  • [Pros] GitOps can reference to a "virtual" model file. CC @aguschin
  • [Cons] The .dvc file does not have checksums and etags. User cannot recognize if the file was changed in the cloud or not (compared to the last time import-url was called).

To cover the latest cons, we can consider introducing dvc import-url ---etags-only (--no-exec but get etags from cloud) and/or dvc update --etags-only.

dmpetrov avatar Mar 09 '22 21:03 dmpetrov

@dmpetrov

Could you please clarify this please:

"It should work now if the Uploading part is based on import-url" - just expand on this a bit. I'm not sure I understand what direction files go when you do dvc update


My initial reaction is that aws s3 cp file s3://path && dvc import-url s3://path file semantics doesn't deserve a global dvc export command to be honest. It still feels very much like import, not export. Since we'll have pretty much an import .dvc file in the repo that detects changes outside and imports a file from a cloud.

External outputs remind export to me. dvc run -d model.pkl -o s3://model.pkl aws s3 cp model.pkl s3://model.pkl. It means that every time model changes in the repo it's being exported to S3.

shcheklein avatar Mar 10 '22 00:03 shcheklein

Could you please clarify this please:

"It should work now if the Uploading part is based on import-url

dvc update re-downloads file. What I mean - a regular dvc update out/model.h5.dvc will work just fine if the result of dvc export is the same as dvc import-url (in contrast to external outputs when you need to re-run pipeline).

The logic is:

  • dvc import-url - downloading an external file
  • dvc export - uploading a file to an external storage
  • dvc update - updating/re-downloading an external file
  • dvc status - check if a local file is synchronized with its external source

To be honest I'd rename the first two to download, upload. If we mixing up the direction then user will have similar issues.

My initial reaction is that aws s3 cp file s3://path && dvc import-url s3://path file semantics doesn't deserve a global dvc export command to be honest.

aws s3 cp is not an option here because we need to abstract out from clouds. Alternatively, we can consider dvc put-url file s3://path && dvc import-url s3://path file but having a single command still looks a better option.

External outputs remind export to me.

Yes, but internal machinery and logic is very different. You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.

dmpetrov avatar Mar 10 '22 03:03 dmpetrov

the result of dvc export is the same as dvc import-url,

that's exactly the sign that we are mixing the semantics

You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.

not necessarily btw, dvc add --external s3://mybucket/existing-data works (at least it worked before)

aws s3 cp is not an option here because we need to abstract out from clouds

Yep, I understand. It's not so much about redundancy of a command, it's more about the semantics still. It confuses me a bit that export internally does import.

For example, we can make dvc export that creates a .dvc file with a single dependency on a model.pkl and an external output to s3://model.pkl . Something like the result of dvc add --external s3://mybucket/existing-data but that also saves information (if it's needed about the local file name that was the source).

And dvc update on this file would work other way around - it would be uploading the file to s3 (exporting).

but having a single command still looks a better option.

If we want to keep this semantics (import link created inside export), I would probably even prefer to have put-url and do import-url manually. It would be less confusing and very explicit to my mind.


Also, if we go back to the From local to Cloud/S3 workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it? Means that dvc update should be uploading the new file in this case. At least that's the way I'm reading this.

shcheklein avatar Mar 10 '22 03:03 shcheklein

And dvc update on this file would work other way around - it would be uploading the file to s3 (exporting).

This looks like the direction of upload is your major concern. Is it correct?

Also, if we go back to the From local to Cloud/S3 workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it?

It means the upload is happening as a result of dvc export. It is decoupled from training and you suppose to re-upload the file by dvc commands. In this case, changing the direction of dvc update might be a better choice from workflow point of view.

dmpetrov avatar Mar 10 '22 04:03 dmpetrov

From local to Cloud/S3

In this scenario, the user has their own local model.h5 file already. It may or may not be tracked by DVC. If it is tracked by DVC, it might be tracked in model.h5.dvc or within dvc.lock (if it's generated by a DVC stage).

If they want to upload to the cloud and keep a pointer locally, dvc export can be equivalent to dvc run --external -n upload_data -d model.h5 -o s3://testproject/model.h5 aws s3 cp model.h5 s3://testproject/model.h5. This is the inverse of import-url, as shown in the example in https://dvc.org/doc/command-reference/import-url#description.

As @shcheklein noted, the workflow here assumes the user saves updates locally, so it makes sense for update to go in the upload direction and enforce a canonical workflow of save locally -> upload new version.

Similar to how import-url records the external path as a dependency and the local path as an output, export can record the local path as a dependency and the local path as an output. Since a model.h5.dvc file may already exist from a previous dvc add (with model.h5 as an output), it might make more sense to save the export info with some other file extension, like model.h5.export.dvc (this avoids conflicts between the dependencies and outputs of each).

I'll follow up on the other scenarios in another comment to keep this from being too convoluted 😅

Edit: On second thought, maybe it's better to resolve this scenario first 😄 . The others might require a separate discussion.

dberenbaum avatar Mar 11 '22 19:03 dberenbaum

If we go to the bi-directional ~~dvc upload~~dvc update then we are splitting it to two major cases:

  1. Local to Storage. It should be based on external outputs. Similar to dvc run --external -o s3://.... dvc update file.dvc is uploading file to cloud.
    • a. no-DVC file. It straightforward. export command just creates a .dvc file.
    • b. DVC file. - Pipeline file (.lock). Q: Should pipeline do an automatic upload if a result of dvc export is in a pipeline? To my mind - it is not necessary since we need to make quite a strong assumption about productization and performance. - Data file (.dvc). dvc export should generate a bit different file.external.dvc in addition to file.dvc. Q: it does not seem like a default use case, can we assume that user will do the renaming manually by dvc export -f file.export.dvc?
  2. Storage to local. It should be be based on external dependencies. Similar to dvc import-url s3://.... dvc uploads downloads files from cloud.
    • a. With a local copy. Just a regular dvc import-url file s3://...
    • b. Without a local copy. Similar to dvc import-url --no-exec but better to introduce dvc import-url --etags-only (see above).

@shcheklein @dberenbaum WDYT?

dmpetrov avatar Mar 11 '22 20:03 dmpetrov

@dmpetrov I think it makes sense. However, I think the "Storage to local" scenarios are a little convoluted.

If model updates are happening external to user code and saved in the cloud, or the user already has a model in the cloud saved previously by their code, import-url makes sense.

If instead they are using dvc to track a model that their user code saves in the cloud, import-url seems awkward because they probably never need a local copy. Even if they use --etags-only, if they use the file downstream, it will need to be downloaded. It's also unintuitive because import-url is intended for downloading dependencies instead of tracking outputs.

An alternative is to change how external outputs work. I put a mini-proposal at the bottom of https://www.notion.so/iterative/Model-Management-in-DVC-af279e36b8be4e929b08df7a491e1a4c. It's still a work in progress, but if you have time, PTAL and see if the direction makes sense.

dberenbaum avatar Mar 11 '22 21:03 dberenbaum

If instead they are using dvc to track a model that their user code saves in the cloud, import-url seems awkward because they probably never need a local copy.

Right. This does not look like a core scenario.

Just to make it clear - Storage to local covers use cases when a model was created from outside of a repository. Examples: user imports an external model to use GitOps/Model-Registry functionality, importing a pre-trained model or existing dataset.

dmpetrov avatar Mar 11 '22 21:03 dmpetrov

In your earlier comment, you seemed to indicate that the scope was broader:

From cloud to workspace

Users write models/data to cloud from user's code (or it is already updated by an external tool).

Are we now limiting it to cases where the model was updated by an external tool?

Edit: Or maybe writing models to cloud from user's code is part of "Local to Storage." Either way, I think there's a core scenario for writing models directly to cloud from user's code that isn't covered by export or import-url.

dberenbaum avatar Mar 11 '22 21:03 dberenbaum

It was described as a broader scenario but the major goal was to cover Lightweight Model Management use case (see "user imports an external model to use GitOps/Model-Registry functionality"). It can be useful in some other scenarios (see "importing a pre-trained model or existing dataset.").

However, importing training model to the same repo does not make sense. We are introducing From local to Cloud for this.

dmpetrov avatar Mar 11 '22 21:03 dmpetrov

In the context of model management we nailed down the scope - see #7918

dmpetrov avatar Jun 18 '22 19:06 dmpetrov

Any news on this? I really want to "materialize" a specific commit to a remote cloud bucket without using directly cloud specific cli command line tools

bhack avatar Aug 26 '23 14:08 bhack

@bhack No progress here, sorry.

efiop avatar Aug 26 '23 23:08 efiop

@bhack Could you explain more about your scenario? One option might be to push to a cloud-versioned remote, which would show files in a more typical directory structure.

dberenbaum avatar Aug 28 '23 00:08 dberenbaum

@dberenbaum In some cases I need to use gcsfuse or similar. As we don't have currently a pull --to-remote option we need to locally materialize on host filesystem, with pull, the requested commit and then sync with cloud bucket using native cli or libraries. Materializing multiple commits in parallel also Is not data efficient.

bhack avatar Aug 28 '23 00:08 bhack