dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc data status` takes ≈1min, even when nothing has changed

Open ryan-williams opened this issue 5 months ago • 3 comments

Bug Report

Description

hudcostreets/ctbk.dev contains 2,105 .parquet.dvc files (roughly: ≈15 datasets, each partitioned into ≈150 monthly files since 2013) totaling 62G.

dvc data status takes ≈1min (even on repeated incantations, when nothing has changed), apparently using just 1 thread. I also experimented with simple DVC pipelines, and some of those dvc commands also seemed to index the repo for ≈1min before doing anything.

Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.

Reproduce

On an m6a.8xlarge instance with:

  • Ubuntu 24.04 amd64 (ami-0731becbf832f281e)
  • 384G EBS gp3 volume with ext4 FS
  • Python 3.12.9
# Clone / Install
git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
git checkout b787b184
pip install -e .

# Before `pull`ing:
time dvc data status
# real	1m10.727s
# user	1m1.733s
# sys	0m7.713s
time dvc data status
# real	1m10.522s
# user	1m1.988s
# sys	0m7.733s

dvc pull  # ≈20mins, 62G downloaded to .dvc/cache + 62G applied in worktree

# After `pull`ing:
time dvc data status
# real	0m59.191s
# user	0m51.454s
# sys	0m7.089s
time dvc data status
# real	0m58.269s
# user	0m50.611s
# sys	0m7.263s

Expected

  • dvc data status should be ≈instantaneous when invoked a 2nd time.
  • If I've only changed one file, it should be correspondingly fast.
  • Indexing(?) the first time should be multi-threaded
    • user is ≈90% of real in the timings above, implying mostly single-threaded processing; monitoring in htop corroborates this.

Environment information

Output of dvc doctor:

$ dvc doctor
VC version: 3.60.0 (pip)
-------------------------
Platform: Python 3.12.9 on Linux-6.8.0-1029-aws-x86_64-with-glibc2.39
Subprojects:
	dvc_data = 3.16.10
	dvc_objects = 5.1.1
	dvc_render = 1.0.2
	dvc_task = 0.40.2
	scmrepo = 3.3.11
Supports:
	http (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
	s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
	ssh (sshfs = 2025.2.0)
Config:
	Global: /home/ubuntu/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, ssh
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c6ba21dd4999cafbf0b94985971d7666

Additional Information (if any):

I'm working on an EXT4 EBS volume:

cat /etc/fstab | column -t -s$'\t'
# LABEL=cloudimg-rootfs  /          ext4   discard,commit=30,errors=remount-ro  0 1
# LABEL=BOOT             /boot      ext4   defaults                             0 2
# LABEL=UEFI             /boot/efi  vfat   umask=0077                           0 1

I also see similar performance on an M1 macbook APFS data volume.

xref https://github.com/iterative/dvc/issues/9428

ryan-williams avatar Jun 09 '25 02:06 ryan-williams

Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.

That's not expected. DVC does cache hashes based on mtimes as you suggested.

Can you share the profiling result for data status? You can generate that using --cprofile-dump= option to any DVC command (See Generating cProfile data).

dvc data status --cprofile-dump status.prof
dvc data status --cprofile-dump status-noop.prof

Please zip these files and drop it in the GitHub comments. I'll take a look.

If you are interested, you can also take a look at it with snakeviz.

pip install snakeviz
snakeviz status.prof

We are also tracking this with a benchmark named test_data_status in https://bench.dvc.org/latest.html, with 70k files which takes 2.3206s on repeated case.

skshetry avatar Jun 09 '25 03:06 skshetry

profs.zip contains 4 .profs:

  • status{1,2}.prof: generated by in a fresh clone of ctbk.dev, without running dvc pull:
    dvc data status --cprofile-dump status1.prof
    dvc data status --cprofile-dump status2.prof
    
  • status{1,2}-pulled.prof: generated after running dvc pull:
    dvc data status --cprofile-dump status1-pulled.prof
    dvc data status --cprofile-dump status2-pulled.prof
    

All 4 profiles look pretty similar in snakeviz:

status1.prof:

Image

Thank you for looking into this, I'm happy to hear it's supposed to be fast!

It's weird that I see comparable slowness on [EC2 AMD Ubuntu] and [M1 ARM Macbook].

ryan-williams avatar Jun 15 '25 01:06 ryan-williams

Looks like you have ~290 import stages in your repository, which is instantiating S3FileSystem eagerly (one filesystem for each stage). That is 48s of the total runtime.

Could you please try this patch and let me know what difference that makes?

pip install "dvc-s3 @ git+https://github.com/iterative/dvc-s3@no-instantiate-s3fs"

Here, we should not be creating filesystems eagerly like this, and even if we did, we are supposed to cache them. I don't see any other issues in the profiling data, so file hashes are not being computed and is being reused from cache.

skshetry avatar Jun 15 '25 02:06 skshetry

Thanks, I ran 3x with that patch:

$ time dvc data status --cprofile-dump patch1.prof
No changes.

real	0m29.538s
user	0m25.206s
sys	0m1.425s
$ time dvc data status --cprofile-dump patch2.prof
No changes.

real	0m25.712s
user	0m24.057s
sys	0m1.174s
$ time dvc data status --cprofile-dump patch3.prof
No changes.

real	0m25.298s
user	0m23.649s
sys	0m1.163s

patch-profs.zip contains the 3 .profs. They all look roughly like:

Image

So it seems like your patch sped things up ≈2-3x, but it's still taking 25-30s. Any other ideas?

ryan-williams avatar Jun 18 '25 16:06 ryan-williams

Thanks for the profiling data. At a quick glance, nothing stands out (except of yaml parsing), everything just seems generally slower.

Please make sure ruamel-yaml-clib is installed, as yaml parsing appears to be slower (~6s).

I'll take a closer look at the profile later today.

skshetry avatar Jun 19 '25 02:06 skshetry

I have released a new version of dvc-s3==3.2.2.

skshetry avatar Jun 19 '25 12:06 skshetry

Thanks, confirmed I have latest ruamel.yaml.clib (0.2.12) installed.

On my M1 macbook (with dvc-s3==3.2.2), I saw:

$ time dvc data status --cprofile-dump m1-1.prof
...
real	0m17.951s
user	0m15.129s
sys	0m0.945s
$ time dvc data status --cprofile-dump m1-2.prof
...
real	0m16.887s
user	0m15.093s
sys	0m0.868s

I then tried removing the "import stages" (see branch prof), but it didn't seem to matter much:

$ git rm s3/tripdata/*.dvc
$ time dvc data status --cprofile-dump m1-no-imports1.prof
real	0m15.198s
user	0m13.369s
sys	0m0.785s
$ time dvc data status --cprofile-dump m1-no-imports2.prof
real	0m15.371s
user	0m13.394s
sys	0m0.802s

You mentioned:

Looks like you have ~290 import stages in your repository

There's 145 s3/tripdata/*.dvc files (half of 290), each generated by dvc import-url. Not sure why each counts as 2, just noting.

You should be able to reproduce any of the above, without dvc pulling:

git clone -b prof https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
pip install -e .
time dvc data status --cprofile-dump no-pull.prof

I'm seeing similar slowness on M1 macbook and EC2 AMD Ubuntu, so it seems relatively common/universal.

Here's a .zip with 8 profiles:

$ unzip -l profs2.zip
Archive:  profs2.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   624565  06-19-2025 09:55   m1-1.prof
   624833  06-19-2025 09:55   m1-2.prof
   623204  06-19-2025 10:42   m1-no-imports1.prof
   622960  06-19-2025 10:42   m1-no-imports2.prof
   596595  06-19-2025 10:53   ec2-no-imports-no-pull1.prof
   594175  06-19-2025 10:53   ec2-no-imports-no-pull2.prof
   600119  06-19-2025 10:52   ec2-no-imports1.prof
   597880  06-19-2025 10:53   ec2-no-imports2.prof
---------                     -------
  4884331                     8 files

Thanks for your continued help here! The DVC data/versioning model seems awesome, I'd love to be able to use it at scale in my projects.

ryan-williams avatar Jun 19 '25 14:06 ryan-williams

There's 145 s3/tripdata/*.dvc files (half of 290), each generated by dvc import-url. Not sure why each counts as 2, just noting.

Right, DVC reads .dvc/dvc.yaml/dvc.lock files from your current workspace and from Git HEAD.

skshetry avatar Jun 19 '25 15:06 skshetry

@ryan-williams, how much time does it take without --cprofile-dump? I am wondering what that overhead is.

skshetry avatar Jun 20 '25 02:06 skshetry

I tested with the following commands:

git clone https://github.com/hudcostreets/ctbk.dev /tmp/ctbk.dev
time dvc --cd /tmp/ctbk.dev data status

And, I get a much better result with new dvc-s3 installed on my machine (M4):

________________________________________________________
Executed in    4.82 secs    fish           external
   usr time    4.06 secs    0.10 millis    4.06 secs
   sys time    0.42 secs    1.34 millis    0.42 secs

That’s from the first run. The second run is slightly faster, around 4.38s. On the prof branch, dvc data status is slightly faster.

However, when using --cprofile-dump, the runtime nearly doubles to about 9 seconds. I've attached the profiling data here:

profiling.zip

skshetry avatar Jun 20 '25 03:06 skshetry

What kind of machine are you running that on?

On M1 macbook, I'm now seeing 6-7s for time dvc data status, and on EC2 AMD Ubuntu 24.04 (ext4 FS) 13-15s.

Both cases corroborate that --cprofile-dump was halving speed. Still seems like there's an order of magnitude that should be possible to shave off. git status in the same repo takes .01-.1s, for comparison (I know it's doing less work, but still, "status" taking >>1s feels bad / interrupts flow).

ryan-williams avatar Jun 20 '25 14:06 ryan-williams

I'm seeing something similar running dvc status with a folder of ~200k Avro files, and with multiple pipeline stages that depend on this folder, the calculation is being redone each time in dvc status. I don't want to depend on the archived version of the folder since someone might mutate the unzipped one after the fact. Any workarounds or ideas for improving the algorithm?

petebachant avatar Jun 23 '25 12:06 petebachant

What kind of machine are you running that on?

On M1 macbook, I'm now seeing 6-7s for time dvc data status, and on EC2 AMD Ubuntu 24.04 (ext4 FS) 13-15s.

Both cases corroborate that --cprofile-dump was halving speed. Still seems like there's an order of magnitude that should be possible to shave off. git status in the same repo takes .01-.1s, for comparison (I know it's doing less work, but still, "status" taking >>1s feels bad / interrupts flow).

6–7s is a significant improvement over the original 1-minute runtime when the issue was created. While it doesn’t meet the sub-second target you were hoping for, profiling doesn’t show any immediate optimizations given the constraints of our architecture, SQLite, and Python performance.

I am closing this as completed for now.

skshetry avatar Jul 24 '25 11:07 skshetry

Thanks for your help here @skshetry!

ryan-williams avatar Jul 24 '25 14:07 ryan-williams