dvc icon indicating copy to clipboard operation
dvc copied to clipboard

import/update: cache git repos/clones

Open casperdcl opened this issue 4 years ago • 7 comments

dvc import https://some/git/repo/ some_file
dvc update  # should not re-clone, should only pull into existing cache
  • related: #3438, #3473

casperdcl avatar Mar 17 '20 17:03 casperdcl

The thing is cache is not persisted between dvc runs, if we make it persist then that won't reclone only make git pull in dvc update.

Suor avatar Apr 03 '20 04:04 Suor

yes; this is about making it persistent & pulling rather than re-cloning.

casperdcl avatar Apr 03 '20 12:04 casperdcl

What about a repo cache at the user level? Could be a system config var so you can disable it, like analytics.

Context: #4203

jorgeorpinel avatar Jul 15 '20 16:07 jorgeorpinel

in light of #4246 being merged going to downgrade priority here...

casperdcl avatar Jul 19 '21 18:07 casperdcl

Persistent clones (as per #10511) are different from shallow clones (as per #4246). Both speed up cloning (or potentially avoid it) but only persistent clones can allow us to work with imported data without internet connectivity, which is necessary for us on a HPC where most queues have no connectivity.

Persistent clones would also allow us to separate cloning (which requires connectivity) from other dvc operations (which don't). This would allow us to do the former in an environment (queue) with connectivity and the latter in environments without.

johnyaku avatar Aug 14 '24 22:08 johnyaku

@johnyaku Have you considered keeping a clone on a shared space of the HPC so you can import from there instead of from the internet? Even if dvc had some support for caching clones, it would likely still need to check the internet to fetch updates from those clones. If you have your own clone of the repo, you can fully control when to update it and everyone can share that single repo copy (dvc will not make a new clone of a local repo).

dberenbaum avatar Aug 15 '24 13:08 dberenbaum