dvc
dvc copied to clipboard
import: Allow pushing imported files to remote
Currently, files imported by dvc import
come into the local cache, but do not get push
ed to any remotes. This can be useful in some cases, however at other times it might be desirable to make full backups of imported files, while still retaining the ability to update the local copies via dvc update
.
Discussion with @efiop in Discord at https://discordapp.com/channels/485586884165107732/563406153334128681/751199400931491900 and https://discordapp.com/channels/485586884165107732/565699007037571084/751203534351106119.
To provide this option, it should be enough to just alter https://github.com/iterative/dvc/blob/79e8e4edafb8a4745200f0a62f49c29ef074bdea/dvc/output/base.py#L463-L467 to not always skip imports, according to some flag in the dvcfile.
For the CLI, we though dvc import --backup
conveys the meaning. For the DVCfile, an addition to outs
:
deps: ...
outs:
- md5: 123456
path: my_file.txt
backup: True
So for imports with backup: True
, dvc knows:
- On update / import it should store the files in the local cache (like it already does)
- On push it should transfer them to the remote (this is new, but easy to implement by avoiding the code branch linked above)
- On fetch / pull treat these like normal files from e.g.
dvc add
, so do not interact with the import source, only with the DVC remote
The idea is for imports with --backup
to be severed from their import source completely unless the user calls dvc update
explicitly.
The backup
entry in outs
should be added for all outs, with the same meaning: that way you can tell just by looking at outs
in a dvc file how the file will behave when you push / pull it, without imports having to be treated differently to normal files.
To keep backwards compatibility, dvc files without backup
entries should have default values of backup = False
for imports and backup = True
for everything else.
So the plan of action is:
- [x] Add
backup
option toouts
in dvc files with default setting dependant on the type of file - [x] Change
dvc push
logic to remove custom logic for imports, replace with skipping push to remote ifbackup == False
- [x] Change fetch logic to only update the import from its source if
backup == False
. Display error (warning?) if files aren't present in remote or in cache, just like for ordinary files (i.e. follow the normal path for files added viadvc add
)
That third point could instead have been "only update the import if backup==False
or files aren't present in the cache/remote", but I think it's better as above. This way you know for sure that dvc pull
only interacts with your remote whereas dvc update
goes and fetches your imports.
- [x] Also, somewhere there needs to be a check for incompatible options in
outs
- if a dvc file hascache: False
andbackup: True
in itsouts
section, that's impossible and should throw an error.
The idea is for imports with --backup to be severed from their import source completely unless the user calls dvc update explicitly.
Does update
reset backup to false? What about re-importing (running import
again but without --backup
?
The backup entry in outs should be added for all outs, with the same meaning... without imports having to be treated differently to normal files. To keep backwards compatibility, dvc files without backup entries should have default values of backup = False for imports and backup = True for everything else.
So if we're still having to differentiate .dvc files based on contents of deps
(regular vs. import) for backward compatibility, doesn't that defeat the purpose of adding the field to all outs going fwd for consistency? I'm of the inclination that backup
should only have an effect on import .dvc files, TBH as it seems pretty edge and not something mainstream we want to be super visible.
Also, what about outs
in dvc.yaml
files? would those also always have the backup field? (This would affect docs BTW) — again, feels like a big change for a small use case.
@jorgeorpinel I'll answer in the PR, I'm getting confused between here, the PR and the docs PR!