dvc
dvc copied to clipboard
dvc.api DVCFileSystem do not use cache
Hi ! I am encountering some issues while using DVC. We would like to be able to fetch a versionned dataset using DVC.
Help would be greatly appreciated !
Thank you a lot for the work you put into this !
Bug Report
Description
DVC documentation states that cache will be reused when using DVCFileSystem.get command. Running the function twice redownload everything.
Reproduce
Execute this inside a jupyter notebook:
Given a repository configured as:
[core]
analytics = false
autostage = true
remote = storage
['remote "storage"']
url = s3://bucket/dataset/
endpointurl = http://minio.endpoint.com:8910
access_key_id = user
secret_access_key = pass
from dvc.api import DVCFileSystem
url = f"https://user:{token}@path/to/repo"
dvc_fs = DVCFileSystem(url, rev="v3")
%%time
dvc_fs.get("data", "data", recursive=True)
%%time
dvc_fs.get("data", "data", recursive=True)
Expected
I expected second call to be much quicker to execute since it should only check files are present.
Environment information
Python 3.10.12 dvc==3.13.3
Output of dvc doctor:
9:34:01 › dvc doctor
DVC version: 3.13.3 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 2.12.2
dvc_objects = 0.25.0
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.1.0
Supports:
http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.6.0, boto3 = 1.28.17)
Config:
Global: /home/imagedpt/.config/dvc
System: /etc/xdg/xdg-xmonad/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/a51859b3cd5d2e970e30c86866fa3b40
3.13.3 is a very old dvc version. Could you please try with the latest version?
I somehow ended up with a old version of DVC... works like a charm on dvc[s3]==3.51.2.
EDIT: i spoke too fast.
The former code presented do not fill up cache. From my understanding, the cache is created when running:
dvc_fs.repo.pull()
Which fails with following exception (but still creates the cache at /tmp/tmpixxn04mvdvc-cache/files)
dvc_fs.repo.pull(recursive=True, allow_missing=True, )
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[9], line 1
----> 1 dvc_fs.repo.pull(recursive=True, allow_missing=True)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/__init__.py:58, in locked.<locals>.wrapper(repo, *args, **kwargs)
55 @wraps(f)
56 def wrapper(repo, *args, **kwargs):
57 with lock_repo(repo):
---> 58 return f(repo, *args, **kwargs)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/pull.py:42, in pull(self, targets, jobs, remote, all_branches, with_deps, all_tags, force, recursive, all_commits, run_cache, glob, allow_missing)
30 processed_files_count = self.fetch(
31 expanded_targets,
32 jobs,
(...)
39 run_cache=run_cache,
40 )
41 try:
---> 42 stats = self.checkout(
43 targets=expanded_targets,
44 with_deps=with_deps,
45 force=force,
46 recursive=recursive,
47 allow_missing=allow_missing,
48 )
49 except CheckoutError as exc:
50 exc.stats["fetched"] = processed_files_count
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/__init__.py:58, in locked.<locals>.wrapper(repo, *args, **kwargs)
55 @wraps(f)
56 def wrapper(repo, *args, **kwargs):
57 with lock_repo(repo):
---> 58 return f(repo, *args, **kwargs)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc/repo/checkout.py:178, in checkout(self, targets, with_deps, force, relink, recursive, allow_missing, **kwargs)
175 out_path = self.fs.join(self.root_dir, *key)
177 if out_path in failed:
--> 178 self.fs.remove(out_path, recursive=True)
179 else:
180 self.state.save_link(out_path, self.fs)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/dvc_objects/fs/base.py:567, in FileSystem.rm(self, path, recursive, **kwargs)
561 def rm(
562 self,
563 path: Union[AnyFSPath, list[AnyFSPath]],
564 recursive: bool = False,
565 **kwargs,
566 ) -> None:
--> 567 self.fs.rm(path, recursive=recursive, **kwargs)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1215, in AbstractFileSystem.rm(self, path, recursive, maxdepth)
1200 def rm(self, path, recursive=False, maxdepth=None):
1201 """Delete files.
1202
1203 Parameters
(...)
1213 possible.
1214 """
-> 1215 path = self.expand_path(path, recursive=recursive, maxdepth=maxdepth)
1216 for p in reversed(path):
1217 self.rm_file(p)
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1143, in AbstractFileSystem.expand_path(self, path, recursive, maxdepth, **kwargs)
1140 raise ValueError("maxdepth must be at least 1")
1142 if isinstance(path, (str, os.PathLike)):
-> 1143 out = self.expand_path([path], recursive, maxdepth)
1144 else:
1145 out = set()
File ~/.virtualenvs/dvc_s3/lib/python3.10/site-packages/fsspec/spec.py:1177, in AbstractFileSystem.expand_path(self, path, recursive, maxdepth, **kwargs)
1175 out.add(p)
1176 if not out:
-> 1177 raise FileNotFoundError(path)
1178 return sorted(out)
FileNotFoundError: ['/data']
For those who could be having the same issue. My workaround is the following:
first call dvc_fs.repo.fetch() that will fill up the cache using asyncio (faster).
You can then call the get method that will be faster as cache will be used. To be noted, cache is not filled up by the get method call.
dvcfs.repo is an internal of a DVCFileSystem, so I cannot help with it unfortunately.
I looked into DVCFileSystem and the fact that files are not cached is expected, since the cache directories are ephemeral for remote urls. Even for local urls, we'll spend unnecessary cycles saving them to the cache and can be much faster to just stream from remotes.
With that said, there is a yet to-be documented cache=True|False argument that you can pass to get and get_file API which will do the caching.