dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc.api DVCFileSystem do not use cache

Open clenico opened this issue 1 year ago • 2 comments

Hi ! I am encountering some issues while using DVC. We would like to be able to fetch a versionned dataset using DVC.

Help would be greatly appreciated !

Thank you a lot for the work you put into this !

Bug Report

Description

DVC documentation states that cache will be reused when using DVCFileSystem.get command. Running the function twice redownload everything.

Reproduce

Execute this inside a jupyter notebook:

Given a repository configured as:

[core]
    analytics = false
    autostage = true
    remote = storage
['remote "storage"']
    url = s3://bucket/dataset/
    endpointurl = http://minio.endpoint.com:8910
    access_key_id = user
    secret_access_key = pass
from dvc.api import DVCFileSystem
url = f"https://user:{token}@path/to/repo"
dvc_fs = DVCFileSystem(url, rev="v3")
%%time
dvc_fs.get("data", "data", recursive=True)
%%time
dvc_fs.get("data", "data", recursive=True)

Expected

I expected second call to be much quicker to execute since it should only check files are present.

Environment information

Python 3.10.12 dvc==3.13.3

Output of dvc doctor:

9:34:01 › dvc doctor                                                                                                                                                 
DVC version: 3.13.3 (pip)                                                                                                                       
-------------------------                                                                                                             
Platform: Python 3.10.12 on Linux-6.5.0-41-generic-x86_64-with-glibc2.35                                                                                               
Subprojects:                                                                                                                                                  
        dvc_data = 2.12.2
        dvc_objects = 0.25.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.1.0
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.28.17)
Config:
        Global: /home/imagedpt/.config/dvc
        System: /etc/xdg/xdg-xmonad/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/a51859b3cd5d2e970e30c86866fa3b40

clenico avatar Jun 28 '24 07:06 clenico

3.13.3 is a very old dvc version. Could you please try with the latest version?

skshetry avatar Jun 28 '24 07:06 skshetry

I somehow ended up with a old version of DVC... works like a charm on dvc[s3]==3.51.2. EDIT: i spoke too fast.

The former code presented do not fill up cache. From my understanding, the cache is created when running:

dvc_fs.repo.pull()

Which fails with following exception (but still creates the cache at /tmp/tmpixxn04mvdvc-cache/files)

dvc_fs.repo.pull(recursive=True, allow_missing=True, )

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 1
----> 1 dvc_fs.repo.pull(recursive=True, allow_missing=True)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/__init__.py:58, in locked.<locals>.wrapper(repo, *args, **kwargs)
     55 @wraps(f)
     56 def wrapper(repo, *args, **kwargs):
     57     with lock_repo(repo):
---> 58         return f(repo, *args, **kwargs)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/pull.py:42, in pull(self, targets, jobs, remote, all_branches, with_deps, all_tags, force, recursive, all_commits, run_cache, glob, allow_missing)
     30 processed_files_count = self.fetch(
     31     expanded_targets,
     32     jobs,
   (...)
     39     run_cache=run_cache,
     40 )
     41 try:
---> 42     stats = self.checkout(
     43         targets=expanded_targets,
     44         with_deps=with_deps,
     45         force=force,
     46         recursive=recursive,
     47         allow_missing=allow_missing,
     48     )
     49 except CheckoutError as exc:
     50     exc.stats["fetched"] = processed_files_count

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/__init__.py:58, in locked.<locals>.wrapper(repo, *args, **kwargs)
     55 @wraps(f)
     56 def wrapper(repo, *args, **kwargs):
     57     with lock_repo(repo):
---> 58         return f(repo, *args, **kwargs)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/checkout.py:178, in checkout(self, targets, with_deps, force, relink, recursive, allow_missing, **kwargs)
    175 out_path = self.fs.join(self.root_dir, *key)
    177 if out_path in failed:
--> 178     self.fs.remove(out_path, recursive=True)
    179 else:
    180     self.state.save_link(out_path, self.fs)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc_objects/fs/base.py:567, in FileSystem.rm(self, path, recursive, **kwargs)
    561 def rm(
    562     self,
    563     path: Union[AnyFSPath, list[AnyFSPath]],
    564     recursive: bool = False,
    565     **kwargs,
    566 ) -> None:
--> 567     self.fs.rm(path, recursive=recursive, **kwargs)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1215, in AbstractFileSystem.rm(self, path, recursive, maxdepth)
   1200 def rm(self, path, recursive=False, maxdepth=None):
   1201     """Delete files.
   1202 
   1203     Parameters
   (...)
   1213         possible.
   1214     """
-> 1215     path = self.expand_path(path, recursive=recursive, maxdepth=maxdepth)
   1216     for p in reversed(path):
   1217         self.rm_file(p)

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1143, in AbstractFileSystem.expand_path(self, path, recursive, maxdepth, **kwargs)
   1140     raise ValueError("maxdepth must be at least 1")
   1142 if isinstance(path, (str, os.PathLike)):
-> 1143     out = self.expand_path([path], recursive, maxdepth)
   1144 else:
   1145     out = set()

File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1177, in AbstractFileSystem.expand_path(self, path, recursive, maxdepth, **kwargs)
   1175             out.add(p)
   1176 if not out:
-> 1177     raise FileNotFoundError(path)
   1178 return sorted(out)

FileNotFoundError: ['/data']

clenico avatar Jun 28 '24 08:06 clenico

For those who could be having the same issue. My workaround is the following: first call dvc_fs.repo.fetch() that will fill up the cache using asyncio (faster).

You can then call the get method that will be faster as cache will be used. To be noted, cache is not filled up by the get method call.

clenico avatar Jul 03 '24 09:07 clenico

dvcfs.repo is an internal of a DVCFileSystem, so I cannot help with it unfortunately.

I looked into DVCFileSystem and the fact that files are not cached is expected, since the cache directories are ephemeral for remote urls. Even for local urls, we'll spend unnecessary cycles saving them to the cache and can be much faster to just stream from remotes.

With that said, there is a yet to-be documented cache=True|False argument that you can pass to get and get_file API which will do the caching.

skshetry avatar Jul 03 '24 12:07 skshetry