dvc-bench icon indicating copy to clipboard operation
dvc-bench copied to clipboard

6.68x regression in `test_sharing-add-noop`

Open dberenbaum opened this issue 3 years ago • 14 comments

Between 2.9.1 and 2.6.3, there was a 6.68x regression in the test_sharing-add-noop benchmark: Screen Shot 2022-01-20 at 9 18 21 AM

dberenbaum avatar Jan 20 '22 14:01 dberenbaum

checkout-noop benchmarks are also slower: Screen Shot 2022-01-20 at 9 22 08 AM Screen Shot 2022-01-20 at 9 21 50 AM

dberenbaum avatar Jan 20 '22 14:01 dberenbaum

This must have started happening from 2.8.0 and is expected, see https://github.com/iterative/dvc/pull/6648#issuecomment-924456165. '

skshetry avatar Jan 20 '22 15:01 skshetry

Do you mean for the checkout behavior specifically? Is it expected that it will also impact add-noop? And by that much?

dberenbaum avatar Jan 20 '22 20:01 dberenbaum

That change affected any command that internally used checkout, including add/checkout/push/pull. Before we used to be intelligent, and used modifiedTime of the directory to skip on noop. With that above change, we try to collect the directory contents anyway (@efiop, please correct me if I'm wrong here). So the regression could be well 1000x for a directory with a million of files and hence not very meaningful here.

skshetry avatar Jan 21 '22 04:01 skshetry

@skshetry Do we have any plan to address this yet?

dberenbaum avatar May 12 '22 20:05 dberenbaum

@skshetry Do we have any plan to address this yet?

yes, we are hoping to improve this by optimizing diff and staging.

skshetry avatar May 13 '22 05:05 skshetry

Before we used to be intelligent, and used modifiedTime of the directory to skip on noop.

Is there any plan to get back to this? Optimizing diff and staging will presumably still be slower than the previous behavior, right?

dberenbaum avatar May 13 '22 15:05 dberenbaum

Is there any plan to get back to this?

I think I was wrong about that, last time I tried it was always trying to stage, so I thought it was the state changes. Or maybe something has changed.

Right now, the time is spent on unnecessary file checkouts. DVC is correctly loading the cached .dir file /Tree from the state. You can see profiling data in the following image:

Screen Shot 2022-05-14 at 12 18 01

* Total runtime is 2.57s for comparison.

The plan is to optimize diff and staging so that we can avoid those unnecessary checkouts. And, in the long term, introduce parallel checkouts.

skshetry avatar May 14 '22 06:05 skshetry

dvc add-ing 1 million files (with 1KB each): Screen Shot 2022-05-15 at 18 01 22

./dump.prof Took 3h20m38s in total.

skshetry avatar May 15 '22 12:05 skshetry

Interesting, thanks @skshetry. Not sure what's going on with the benchmarks in that case. I'm not seeing the reported regression when testing with an unzipped http://images.cocodataset.org/zips/test2014.zip. In fact, it appears to have improved significantly.

It took over a minute to add in 2.6.3:

$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:28, 28.25s/file]

To track the changes with git, run:

        git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [01:03, 63.93s/file]

To track the changes with git, run:

        git add test2014.dvc
dvc add test2014  27.49s user 14.15s system 64% cpu 1:04.72 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (pyarrow = 7.0.0),
        webhdfs (hdfs = 2.5.8),
        http (requests = 2.27.1),
        https (requests = 2.27.1),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

And it only took 22 seconds in 2.10.0:

$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.18s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:22, 22.50s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
dvc add test2014  6.01s user 10.39s system 71% cpu 22.907 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.3.0),
        hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
        ssh (sshfs = 2022.3.1),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

dberenbaum avatar May 16 '22 19:05 dberenbaum

I'm not seeing the reported regression

need to investigate, but what differs between your setup and in the benchmark is the link type: reflink vs copy.

skshetry avatar May 17 '22 02:05 skshetry

Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?

2.6.3:

$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:26, 26.46s/file]

To track the changes with git, run:

        git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [00:04,  4.62s/file]

To track the changes with git, run:

        git add test2014.dvc
dvc add test2014  3.76s user 1.42s system 97% cpu 5.322 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (pyarrow = 7.0.0),
        webhdfs (hdfs = 2.5.8),
        http (requests = 2.27.1),
        https (requests = 2.27.1),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

2.10.0:

$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.22s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:07,  7.97s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
dvc add test2014  4.70s user 3.15s system 91% cpu 8.582 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (fsspec = 2021.7.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2021.7.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

dberenbaum avatar May 17 '22 15:05 dberenbaum

Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?

Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.

skshetry avatar May 17 '22 17:05 skshetry

Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.

Any idea why the operation was so much faster?

dberenbaum avatar May 17 '22 17:05 dberenbaum