dvc icon indicating copy to clipboard operation
dvc copied to clipboard

pull: pulling single file is really slow when there's hundreds other .dvc files

Open Marigold opened this issue 2 years ago • 9 comments

Bug Report

Description

We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like

from dvc.repo import Repo
repo = Repo("repo_root")
repo.pull("my_file.csv.dvc")

This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my .dvcignore ignores as much files as possible, but the bottleneck is loading dvc files anyway)

Thanks!

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.14 on macOS-12.5-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.2.2
	scmrepo = 0.1.4
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3, https, s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information (if any):

Marigold avatar Jan 05 '23 17:01 Marigold

@Marigold do you save each file as a separate DVC object? To rephrase, does it mean that you have ~10K .dvc files?

shcheklein avatar Jan 05 '23 18:01 shcheklein

@shcheklein thanks for a prompt response! Yes, we store each file as a separate DVC object, though we have <1K .dvc files. Every file has its own (custom) metadata so we thought this is the cleanest way. I was looking for ways how to "ignore" other files when doing dvc add/pull [target] (e.g. by monkey-patching DVCIgnore), but didn't find an easy solution.

Marigold avatar Jan 09 '23 07:01 Marigold

This is a side-effect of how we build an index when used_objs() is called. It should not read everything on repo.pull("my_file.csv.dvc"), we already optimize for this on other cases.

skshetry avatar Jan 09 '23 09:01 skshetry

Thanks for clarification @skshetry. We've hacked it by dynamically changing .dvcignore (also tried subrepos, but I ran into problems) so we're good for now, though it would be great if this worked fast out of the box.

Marigold avatar Jan 09 '23 10:01 Marigold

@Marigold, regarding dvc add/import, dvc needs to build a graph so that there are no overlaps/duplications/cycles, which means dvc has to read all .dvc files. There is a way to skip this, by setting repo._skip_graph_checks = True. But that is broken for the same reason as above.

I'll create a PR to fix that problem, should be fixed in future releases. Regarding, push/pull, I'll try to look into it.

skshetry avatar Jan 09 '23 10:01 skshetry

Much appreciated @skshetry! My hack with .dvcignore turned out to be bad idea, so we're stuck there (it's not a blocker for us, just annoying performance)

Marigold avatar Jan 09 '23 11:01 Marigold

@skshetry did you have a chance to look into this, please? As we scale our data it's becoming a bottleneck. If you don't have time for this, could you at least give me some hints where to fix it (or suggested workaround)?

Marigold avatar Mar 16 '23 11:03 Marigold

Let's keep this open as this has not been fixed. (Or, did you see any improvements?)

skshetry avatar Dec 27 '24 12:12 skshetry

did you see any improvements?

Nope, we switched to our own solution in the end. I debugged it for a while and the overhead came from a couple of functions that looked like they could be easily cached.

Marigold avatar Jan 06 '25 16:01 Marigold