Reimport checkout
Fixes https://github.com/iterative/dvc/issues/10255. This is a partial reversion of https://github.com/iterative/dvc/pull/9246. I'm not sure this is the cleanest or best way to do it, but it solves the problem.
It builds the hashfile object and tries to checkout from that, falling back to download if it fails.
Note that the build step is still slow for 2.x imports.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 90.46%. Comparing base (
dd2b515) to head (d510ad6). Report is 2 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #10388 +/- ##
==========================================
- Coverage 90.72% 90.46% -0.27%
==========================================
Files 501 501
Lines 38865 38885 +20
Branches 5618 5619 +1
==========================================
- Hits 35262 35176 -86
- Misses 2961 3038 +77
- Partials 642 671 +29
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Ping @skshetry
I'll need to investigate. But we set cache.dir to point to the local cache, which should have made the DVCFileSystem use that cache.
https://github.com/iterative/dvc/blob/d510ad612f0d45b52110977b14f352a5dfe4c064/dvc/dependency/repo.py#L154
I am a bit hesitant of this PR as it goes against https://github.com/iterative/dvc/pull/9246, and adds another (complex) codepath.
Yes, I looked into it, and the local cache dir is set, but it's not enough because it's doing download/copy and not checkout.
download() is defined here:
https://github.com/iterative/dvc/blob/7df8b0c09b6c7d527083cacafc1802fd53c47900/dvc/fs/init.py#L47-L76
It calls generic.copy() (https://github.com/iterative/dvc-objects/blob/ae8b95d7ea3bbcf46fc5a09eddc7c4f614023d7b/src/dvc_objects/fs/generic.py#L69), which runs fs.get_file() operations on paths and knows nothing about the cache.
I don't think there is a much simpler way to do it than this PR, and I tried to make it as narrow as possible. It uses #9246 to rely on the local cache, building the object and passing it to dvc-data's checkout. We don't have imported_file.dvc, so there is no way to do a higher-level checkout operation AFAICT.
which runs
fs.get_file()operations on paths and knows nothing about the cache.
fs is a DVCFileSystem, and knows about the cache.
fsis aDVCFileSystem, and knows about the cache.
Yes, but it is doing a copy from that fs to localfs. It is not linking, and even if I try to enable linking, it fails because it is working across filesystems.
Hello, is there any update with this merge request? It would help us a ton if it gets merged.
@skshetry Is there a good reason to keep this blocked? I'm happy to revert if we find time to resolve in a cleaner way later.
@skshetry ping
@dberenbaum, do you have a repro for this? It seems we do use cache and support link_types/cache_types in DataFileSystem.
generic.copy() does call _get and eventually ends up calling get_file for the DataFileSystem.
https://github.com/iterative/dvc-data/blob/c4ed9e8c0cde9be9931136bdd5b9ff80589bcaf7/src/dvc_data/fs.py#L217