dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Reimport checkout

Open dberenbaum opened this issue 1 year ago • 8 comments

Fixes https://github.com/iterative/dvc/issues/10255. This is a partial reversion of https://github.com/iterative/dvc/pull/9246. I'm not sure this is the cleanest or best way to do it, but it solves the problem.

It builds the hashfile object and tries to checkout from that, falling back to download if it fails.

Note that the build step is still slow for 2.x imports.

dberenbaum avatar Apr 12 '24 17:04 dberenbaum

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.46%. Comparing base (dd2b515) to head (d510ad6). Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10388      +/-   ##
==========================================
- Coverage   90.72%   90.46%   -0.27%     
==========================================
  Files         501      501              
  Lines       38865    38885      +20     
  Branches     5618     5619       +1     
==========================================
- Hits        35262    35176      -86     
- Misses       2961     3038      +77     
- Partials      642      671      +29     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 12 '24 18:04 codecov[bot]

Ping @skshetry

dberenbaum avatar Apr 18 '24 12:04 dberenbaum

I'll need to investigate. But we set cache.dir to point to the local cache, which should have made the DVCFileSystem use that cache.

https://github.com/iterative/dvc/blob/d510ad612f0d45b52110977b14f352a5dfe4c064/dvc/dependency/repo.py#L154

I am a bit hesitant of this PR as it goes against https://github.com/iterative/dvc/pull/9246, and adds another (complex) codepath.

skshetry avatar Apr 26 '24 08:04 skshetry

Yes, I looked into it, and the local cache dir is set, but it's not enough because it's doing download/copy and not checkout.

download() is defined here:

https://github.com/iterative/dvc/blob/7df8b0c09b6c7d527083cacafc1802fd53c47900/dvc/fs/init.py#L47-L76

It calls generic.copy() (https://github.com/iterative/dvc-objects/blob/ae8b95d7ea3bbcf46fc5a09eddc7c4f614023d7b/src/dvc_objects/fs/generic.py#L69), which runs fs.get_file() operations on paths and knows nothing about the cache.

I don't think there is a much simpler way to do it than this PR, and I tried to make it as narrow as possible. It uses #9246 to rely on the local cache, building the object and passing it to dvc-data's checkout. We don't have imported_file.dvc, so there is no way to do a higher-level checkout operation AFAICT.

dberenbaum avatar Apr 26 '24 14:04 dberenbaum

which runs fs.get_file() operations on paths and knows nothing about the cache.

fs is a DVCFileSystem, and knows about the cache.

skshetry avatar Apr 29 '24 06:04 skshetry

fs is a DVCFileSystem, and knows about the cache.

Yes, but it is doing a copy from that fs to localfs. It is not linking, and even if I try to enable linking, it fails because it is working across filesystems.

dberenbaum avatar Apr 29 '24 21:04 dberenbaum

Hello, is there any update with this merge request? It would help us a ton if it gets merged.

Honzys avatar Jun 05 '24 12:06 Honzys

@skshetry Is there a good reason to keep this blocked? I'm happy to revert if we find time to resolve in a cleaner way later.

dberenbaum avatar Jun 06 '24 16:06 dberenbaum

@skshetry ping

Honzys avatar Jul 22 '24 14:07 Honzys

@dberenbaum, do you have a repro for this? It seems we do use cache and support link_types/cache_types in DataFileSystem. generic.copy() does call _get and eventually ends up calling get_file for the DataFileSystem.

https://github.com/iterative/dvc-data/blob/c4ed9e8c0cde9be9931136bdd5b9ff80589bcaf7/src/dvc_data/fs.py#L217

skshetry avatar Aug 12 '24 10:08 skshetry