dvc
dvc copied to clipboard
DVCFileSystem.find very slow with many files
Bug Report
Calling find on DVCFileSystem with a repo of about 300,000 files takes about 70 seconds.
Description
Reproduce
This example retrieves all file paths and hashes in a given dataset directory:
from dvc.api import DVCFileSystem
fs = DVCFileSystem(repo_url, rev=rev)
files = fs.find("/Data", detail=True, dvc_only=True)
remote_files = {}
for file, info in files.items():
hash = info['dvc_info']['md5']
remote_files[file] = hash
Given that all this information is stored in a single internal json file, I'd expect this to take as long as downloading that file + a little housekeeping. On a repo with 300,000 files, this takes about 70 seconds.
Expected
This example does the same, but accesses internal datastuctures to make it faster:
from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree
fs = DVCFileSystem(repo_url, rev=rev)
key = fs._get_key_from_relative("Data")
_, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
entry.obj = Tree.load(entry.remote, entry.hash_info)
remote_files = {}
for ikey, (_, hash_info) in entry.obj.iteritems():
file = os.path.join(subkey[0], *ikey)
hash = hash_info.value
remote_files[file] = hash
On a repo with 300,000 files, this takes about 10 seconds. Most of that time is spent downloading.
A factor of 7 seems like too much overhead to me. Hopefully, this can be improved.
DVC version: 2.38.1 (conda)
---------------------------------
Platform: Python 3.9.15 on Windows-10-10.0.19045-SP0
Subprojects:
dvc_data = 0.28.4
dvc_objects = 0.14.0
dvc_render = 0.0.15
dvc_task = 0.1.8
dvclive = 1.3.0
scmrepo = 0.1.4
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
ssh (sshfs = 0.0.0)
I just tried this with DVC version 3.6.0, and it seems to have gotten worse: On a repo with 500000 tracked files: DVCFileSystem.find("/", ...) takes about 170 seconds.
The code below which uses internal datastructures takes about 9 seconds. (The code in the first post no longer works on 3.6.0)
from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree
fs = DVCFileSystem(repo_url, rev=rev)
remote_files = {}
for dir in fs.find("/", maxdepth=1, detail=False, withdirs=True, dvc_only=True):
key = fs._get_key_from_relative(dir)
repo, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
storage = dvc_fs.fs.index.storage_map[key].remote
obj = Tree.load(storage.odb, entry.hash_info, hash_name=storage.odb.hash_name)
for ikey, (_, hash_info) in obj.iteritems():
path = os.path.join(subkey[0], *ikey)
path = path.replace("\\", "/")
if path[0] == "/":
path = path[1:]
hash = hash_info.value
remote_files[path] = hash
Here is the output of dvc doctor on the client side:
DVC version: 3.6.0 (conda)
--------------------------
Platform: Python 3.10.12 on Windows-10-10.0.19045-SP0
Subprojects:
dvc_data = 2.6.0
dvc_objects = 0.24.1
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.1.0
Supports:
http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
ssh (sshfs = 2023.7.0)
Config:
Global: C:\Users\deeplearning\AppData\Local\iterative\dvc
System: C:\ProgramData\iterative\dvc
@grahameth, can you please share the cprofile data?
Here you go: fs_find_official.zip This is the DVCFileSystem.find("/", detail=True, dvc_only=True) call that takes multiple minutes.
It looks like it is caching something in a sqlite database that takes most of the time. Maybe there is a way to batch these calls? Or maybe a parameter can be added to find() to omit this behavior.
For comparison, here is the profiledata of my Tree.load code: fs_find_unofficial.zip
Need to investigate if this is also causing performance issues for other operations like get, import, etc.
Pretty much every operation in this repo feels too slow. Let me know if you want anything specific.
@grahameth How large are the files in that directory?
@iterative/dvc Does anyone have a Windows machine to test the scenario? I don't see anything close to those times on my mac.
File size shouldn't matter.
The comparison of dvcfs and direct index manipulation is not totally fair though (DataFileSystem is closer to that). I started looking into it, but got distracted with other stuff. Still plan on taking a look.
@efiop I see much faster times than what's reported, so I'm not sure it's worth a lot of time until we identify what's causing such slowness for @grahameth specifically.
@grahameth How large are the files in that directory?
Well, it's about 500000 files. The largest file is 600MB, many are couple MB to 100MB and most are just a few KB. In total, it's about 900GB. I also have an Ubuntu machine running this repo, and the commands all feel very slow there too. But I haven't timed it there yet. When I find time on Wednesday next week or so, I will try to make a reproducible repo.
@grahameth Please also show dvc doctor
inside the repo. The ones you posted before are from outside the repo and missing some important information.
OK, sorry for the delay.
Here is the output of dvc doctor on the machine with the repository:
DVC version: 3.24.0 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-5.19.0-43-generic-x86_64-with-glibc2.35
Subprojects:
Supports:
azure (adlfs = 2023.8.0, knack = 0.11.0, azure-identity = 1.14.0),
gdrive (pydrive2 = 1.17.0),
gs (gcsfs = 2023.6.0),
hdfs (fsspec = 2023.6.0, pyarrow = 13.0.0),
http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.8.0),
s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
ssh (sshfs = 2023.7.0),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.6.0)
Config:
Global: /home/***/.config/dvc
System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/mapper/ubuntu--vg-***cache--lv
Caches: local
Remotes: local, ssh, ssh, local
Workspace directory: btrfs on /dev/mapper/ubuntu--vg-***cache--lv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/919a6b232ecf764657652ca15cc336f0
I have only censored the username.
I have also tried to make a python script that reproduces the problem:
# pip install "dvc[ssh]"==3.6.0
import os
from dvc.api import DVCFileSystem
import os
import time
from dvc_data.hashfile.tree import Tree
import dvc
print("dvc:", dvc.__version__)
def command(str):
print(str)
os.system(str)
# Create test repository with a lot of files
# Needs only to run once.
if True:
command("mkdir test_repro")
os.chdir('test_repro')
command("git init")
command("git config user.email \"[email protected]\"")
command("git config user.name \"Your Name\"")
command("dvc init")
command("mkdir data")
print("Creating dummy files...")
time_start = time.time()
num_files = 400000
for i in range(num_files):
if i%100000 == 0:
print(f"{i}/{num_files}")
with open(f"data/{i}.txt", "w") as file:
file.write(f"test {i}")
print(f"Done. time: {time.time() - time_start}s")
time_start = time.time()
command("dvc add data")
print(f"dvc add done. time: {time.time() - time_start}s")
command("git add .gitignore")
command("git add data.dvc")
command("git commit -m\"First commit\"")
os.chdir('..')
repo_url = r"test_repro/.git"
#repo_url = r"test.git"
#remote_url = 'ssh://test@localhost:2222/home/test/test_repro/.dvc/cache'
config = {}
#if remote_url is not None:
# config = {"core":{"remote":"myremote"}, "remote":{'myremote': {'url': remote_url}}}
fs = DVCFileSystem(repo_url, config=config)
print("get remote files...")
time_start = time.time()
remote_files = {}
if False:
# Official version, but very slow.
# On Windows: remote files: 400000 time: 16.44015669822693s
files = fs.find("/", detail=True, dvc_only=True)
for f, info in files.items():
path = f
path = path.replace("\\", "/")
if path[0] == "/":
path = path[1:]
hash = info['dvc_info']['md5']
remote_files[path] = hash
else:
# Faster version. Uses internal objects.
# On Windows: remote files: 400000 time: 6.956979990005493s
for dir in fs.find("/", maxdepth=1, detail=False, withdirs=True, dvc_only=True):
key = fs._get_key_from_relative(dir)
repo, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
storage = dvc_fs.fs.index.storage_map[key].remote
obj = Tree.load(storage.odb, entry.hash_info, hash_name=storage.odb.hash_name)
for ikey, (_, hash_info) in obj.iteritems():
path = os.path.join(subkey[0], *ikey)
path = path.replace("\\", "/")
if path[0] == "/":
path = path[1:]
hash = hash_info.value
remote_files[path] = hash
print(f"remote files: {len(remote_files)} time: {time.time() - time_start}s")
fs.get("data/0.txt", lpath="0.txt")
This script shows a time difference on Windows (16s vs 6s), but it's nowhere near the 3 minutes that I experience on the production repository. In production I use ssh to acces the remote repository, might that be the issue? The production repository also has much bigger files, but I don't see how that might cause this issue.
@grahameth The comparison approach on non-prod data is probably not the best way to go here and might be misleading. I don't see anything obvious with your cprofile report above, so this is likely about investigating and optimizing multiple things in our implementation in your particular environment. It has been quite some time already and your hack probably works fine for you for now, so unfortunately we probably won't have time to do that ourselves any time soon :(