dvc icon indicating copy to clipboard operation
dvc copied to clipboard

import: Allow pushing imported files to remote

Open charlesbaynham opened this issue 4 years ago • 5 comments

Currently, files imported by dvc import come into the local cache, but do not get pushed to any remotes. This can be useful in some cases, however at other times it might be desirable to make full backups of imported files, while still retaining the ability to update the local copies via dvc update.

Discussion with @efiop in Discord at https://discordapp.com/channels/485586884165107732/563406153334128681/751199400931491900 and https://discordapp.com/channels/485586884165107732/565699007037571084/751203534351106119.

To provide this option, it should be enough to just alter https://github.com/iterative/dvc/blob/79e8e4edafb8a4745200f0a62f49c29ef074bdea/dvc/output/base.py#L463-L467 to not always skip imports, according to some flag in the dvcfile.

For the CLI, we though dvc import --backup conveys the meaning. For the DVCfile, an addition to outs:

deps: ...
outs:
- md5: 123456
  path: my_file.txt
  backup: True

So for imports with backup: True, dvc knows:

  • On update / import it should store the files in the local cache (like it already does)
  • On push it should transfer them to the remote (this is new, but easy to implement by avoiding the code branch linked above)
  • On fetch / pull treat these like normal files from e.g. dvc add, so do not interact with the import source, only with the DVC remote

The idea is for imports with --backup to be severed from their import source completely unless the user calls dvc update explicitly.

charlesbaynham avatar Sep 03 '20 22:09 charlesbaynham

The backup entry in outs should be added for all outs, with the same meaning: that way you can tell just by looking at outs in a dvc file how the file will behave when you push / pull it, without imports having to be treated differently to normal files.

To keep backwards compatibility, dvc files without backup entries should have default values of backup = False for imports and backup = True for everything else.

charlesbaynham avatar Sep 19 '20 09:09 charlesbaynham

So the plan of action is:

  • [x] Add backup option to outs in dvc files with default setting dependant on the type of file
  • [x] Change dvc push logic to remove custom logic for imports, replace with skipping push to remote if backup == False
  • [x] Change fetch logic to only update the import from its source if backup == False. Display error (warning?) if files aren't present in remote or in cache, just like for ordinary files (i.e. follow the normal path for files added via dvc add)

That third point could instead have been "only update the import if backup==False or files aren't present in the cache/remote", but I think it's better as above. This way you know for sure that dvc pull only interacts with your remote whereas dvc update goes and fetches your imports.

charlesbaynham avatar Sep 19 '20 09:09 charlesbaynham

  • [x] Also, somewhere there needs to be a check for incompatible options in outs - if a dvc file has cache: False and backup: True in its outs section, that's impossible and should throw an error.

charlesbaynham avatar Sep 19 '20 09:09 charlesbaynham

The idea is for imports with --backup to be severed from their import source completely unless the user calls dvc update explicitly.

Does update reset backup to false? What about re-importing (running import again but without --backup?

The backup entry in outs should be added for all outs, with the same meaning... without imports having to be treated differently to normal files. To keep backwards compatibility, dvc files without backup entries should have default values of backup = False for imports and backup = True for everything else.

So if we're still having to differentiate .dvc files based on contents of deps (regular vs. import) for backward compatibility, doesn't that defeat the purpose of adding the field to all outs going fwd for consistency? I'm of the inclination that backup should only have an effect on import .dvc files, TBH as it seems pretty edge and not something mainstream we want to be super visible.

Also, what about outs in dvc.yaml files? would those also always have the backup field? (This would affect docs BTW) — again, feels like a big change for a small use case.

jorgeorpinel avatar Sep 21 '20 17:09 jorgeorpinel

@jorgeorpinel I'll answer in the PR, I'm getting confused between here, the PR and the docs PR!

charlesbaynham avatar Sep 21 '20 17:09 charlesbaynham