dvc icon indicating copy to clipboard operation
dvc copied to clipboard

DVCFileSystem.find very slow with many files

Open grahameth opened this issue 2 years ago • 17 comments

Bug Report

Calling find on DVCFileSystem with a repo of about 300,000 files takes about 70 seconds.

Description

Reproduce

This example retrieves all file paths and hashes in a given dataset directory:

from dvc.api import DVCFileSystem

fs = DVCFileSystem(repo_url, rev=rev)
files = fs.find("/Data", detail=True, dvc_only=True)
remote_files = {}
for file, info in files.items():
    hash = info['dvc_info']['md5']
    remote_files[file] = hash

Given that all this information is stored in a single internal json file, I'd expect this to take as long as downloading that file + a little housekeeping. On a repo with 300,000 files, this takes about 70 seconds.

Expected

This example does the same, but accesses internal datastuctures to make it faster:

from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree

fs = DVCFileSystem(repo_url, rev=rev)
key = fs._get_key_from_relative("Data")
_, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
entry.obj = Tree.load(entry.remote, entry.hash_info)

remote_files = {}
for ikey, (_, hash_info) in entry.obj.iteritems():
    file = os.path.join(subkey[0], *ikey)
    hash = hash_info.value
    remote_files[file] = hash

On a repo with 300,000 files, this takes about 10 seconds. Most of that time is spent downloading.

A factor of 7 seems like too much overhead to me. Hopefully, this can be improved.

DVC version: 2.38.1 (conda)
---------------------------------
Platform: Python 3.9.15 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.3.0
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        ssh (sshfs = 0.0.0)

grahameth avatar Jan 27 '23 13:01 grahameth

I just tried this with DVC version 3.6.0, and it seems to have gotten worse: On a repo with 500000 tracked files: DVCFileSystem.find("/", ...) takes about 170 seconds.

The code below which uses internal datastructures takes about 9 seconds. (The code in the first post no longer works on 3.6.0)

from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree

fs = DVCFileSystem(repo_url, rev=rev)
remote_files = {}
for dir in fs.find("/", maxdepth=1, detail=False, withdirs=True, dvc_only=True):
    key = fs._get_key_from_relative(dir)
    repo, dvc_fs, subkey = fs._get_subrepo_info(key)
    entry = dvc_fs.fs.index._trie.get(subkey)
    storage = dvc_fs.fs.index.storage_map[key].remote
    obj = Tree.load(storage.odb, entry.hash_info, hash_name=storage.odb.hash_name)

    for ikey, (_, hash_info) in obj.iteritems():
        path = os.path.join(subkey[0], *ikey)
        path = path.replace("\\", "/")
        if path[0] == "/":
            path = path[1:]
        hash = hash_info.value
        remote_files[path] = hash

Here is the output of dvc doctor on the client side:

DVC version: 3.6.0 (conda)
--------------------------
Platform: Python 3.10.12 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 2.6.0
        dvc_objects = 0.24.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.1.0
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.7.0)
Config:
        Global: C:\Users\deeplearning\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc

grahameth avatar Sep 05 '23 10:09 grahameth

@grahameth, can you please share the cprofile data?

skshetry avatar Sep 05 '23 11:09 skshetry

Here you go: fs_find_official.zip This is the DVCFileSystem.find("/", detail=True, dvc_only=True) call that takes multiple minutes.

It looks like it is caching something in a sqlite database that takes most of the time. Maybe there is a way to batch these calls? Or maybe a parameter can be added to find() to omit this behavior.

For comparison, here is the profiledata of my Tree.load code: fs_find_unofficial.zip

grahameth avatar Sep 07 '23 13:09 grahameth

Need to investigate if this is also causing performance issues for other operations like get, import, etc.

dberenbaum avatar Oct 03 '23 12:10 dberenbaum

Pretty much every operation in this repo feels too slow. Let me know if you want anything specific.

grahameth avatar Oct 04 '23 14:10 grahameth

@grahameth How large are the files in that directory?

@iterative/dvc Does anyone have a Windows machine to test the scenario? I don't see anything close to those times on my mac.

dberenbaum avatar Oct 05 '23 17:10 dberenbaum

File size shouldn't matter.

The comparison of dvcfs and direct index manipulation is not totally fair though (DataFileSystem is closer to that). I started looking into it, but got distracted with other stuff. Still plan on taking a look.

efiop avatar Oct 05 '23 19:10 efiop

@efiop I see much faster times than what's reported, so I'm not sure it's worth a lot of time until we identify what's causing such slowness for @grahameth specifically.

dberenbaum avatar Oct 05 '23 19:10 dberenbaum

@grahameth How large are the files in that directory?

Well, it's about 500000 files. The largest file is 600MB, many are couple MB to 100MB and most are just a few KB. In total, it's about 900GB. I also have an Ubuntu machine running this repo, and the commands all feel very slow there too. But I haven't timed it there yet. When I find time on Wednesday next week or so, I will try to make a reproducible repo.

grahameth avatar Oct 06 '23 16:10 grahameth

@grahameth Please also show dvc doctor inside the repo. The ones you posted before are from outside the repo and missing some important information.

efiop avatar Oct 06 '23 17:10 efiop

OK, sorry for the delay.

Here is the output of dvc doctor on the machine with the repository:

DVC version: 3.24.0 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-5.19.0-43-generic-x86_64-with-glibc2.35
Subprojects:

Supports:
        azure (adlfs = 2023.8.0, knack = 0.11.0, azure-identity = 1.14.0),
        gdrive (pydrive2 = 1.17.0),
        gs (gcsfs = 2023.6.0),
        hdfs (fsspec = 2023.6.0, pyarrow = 13.0.0),
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.8.0),
        s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
        ssh (sshfs = 2023.7.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.6.0)
Config:
        Global: /home/***/.config/dvc
        System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/mapper/ubuntu--vg-***cache--lv
Caches: local
Remotes: local, ssh, ssh, local
Workspace directory: btrfs on /dev/mapper/ubuntu--vg-***cache--lv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/919a6b232ecf764657652ca15cc336f0

I have only censored the username.

I have also tried to make a python script that reproduces the problem:

# pip install "dvc[ssh]"==3.6.0

import os
from dvc.api import DVCFileSystem
import os
import time
from dvc_data.hashfile.tree import Tree
import dvc
print("dvc:", dvc.__version__)

def command(str):
    print(str)
    os.system(str)

# Create test repository with a lot of files
# Needs only to run once.
if True:
    command("mkdir test_repro")
    os.chdir('test_repro')
    command("git init")
    command("git config user.email \"[email protected]\"")
    command("git config user.name \"Your Name\"")

    command("dvc init")
    command("mkdir data")

    print("Creating dummy files...")
    time_start = time.time()
    num_files = 400000
    for i in range(num_files):
        if i%100000 == 0:
            print(f"{i}/{num_files}")
        with open(f"data/{i}.txt", "w") as file:
            file.write(f"test {i}")
    print(f"Done. time: {time.time() - time_start}s")

    time_start = time.time()
    command("dvc add data")
    print(f"dvc add done. time: {time.time() - time_start}s")
    command("git add .gitignore")
    command("git add data.dvc")
    command("git commit -m\"First commit\"")
    os.chdir('..')

repo_url = r"test_repro/.git"
#repo_url = r"test.git"
#remote_url = 'ssh://test@localhost:2222/home/test/test_repro/.dvc/cache'


config = {}
#if remote_url is not None:
#    config = {"core":{"remote":"myremote"}, "remote":{'myremote': {'url': remote_url}}}
fs = DVCFileSystem(repo_url, config=config)

print("get remote files...")
time_start = time.time()
remote_files = {}
if False:
    # Official version, but very slow.
    # On Windows: remote files: 400000 time: 16.44015669822693s
    files = fs.find("/", detail=True, dvc_only=True)
    for f, info in files.items():
        path = f
        path = path.replace("\\", "/")
        if path[0] == "/":
            path = path[1:]
        hash = info['dvc_info']['md5']
        remote_files[path] = hash
else:
    # Faster version. Uses internal objects.
    # On Windows: remote files: 400000 time: 6.956979990005493s

    for dir in fs.find("/", maxdepth=1, detail=False, withdirs=True, dvc_only=True):
        key = fs._get_key_from_relative(dir)
        repo, dvc_fs, subkey = fs._get_subrepo_info(key)
        entry = dvc_fs.fs.index._trie.get(subkey)
        storage = dvc_fs.fs.index.storage_map[key].remote
        obj = Tree.load(storage.odb, entry.hash_info, hash_name=storage.odb.hash_name)

        for ikey, (_, hash_info) in obj.iteritems():
            path = os.path.join(subkey[0], *ikey)
            path = path.replace("\\", "/")
            if path[0] == "/":
                path = path[1:]
            hash = hash_info.value
            remote_files[path] = hash
print(f"remote files: {len(remote_files)} time: {time.time() - time_start}s")

fs.get("data/0.txt", lpath="0.txt")

This script shows a time difference on Windows (16s vs 6s), but it's nowhere near the 3 minutes that I experience on the production repository. In production I use ssh to acces the remote repository, might that be the issue? The production repository also has much bigger files, but I don't see how that might cause this issue.

grahameth avatar Oct 23 '23 09:10 grahameth

@grahameth The comparison approach on non-prod data is probably not the best way to go here and might be misleading. I don't see anything obvious with your cprofile report above, so this is likely about investigating and optimizing multiple things in our implementation in your particular environment. It has been quite some time already and your hack probably works fine for you for now, so unfortunately we probably won't have time to do that ourselves any time soon :(

efiop avatar Feb 19 '24 13:02 efiop