dvc
dvc copied to clipboard
`dvc data status` takes ≈1min, even when nothing has changed
Bug Report
Description
hudcostreets/ctbk.dev contains 2,105 .parquet.dvc files (roughly: ≈15 datasets, each partitioned into ≈150 monthly files since 2013) totaling 62G.
dvc data status takes ≈1min (even on repeated incantations, when nothing has changed), apparently using just 1 thread. I also experimented with simple DVC pipelines, and some of those dvc commands also seemed to index the repo for ≈1min before doing anything.
Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.
Reproduce
On an m6a.8xlarge instance with:
- Ubuntu 24.04 amd64 (
ami-0731becbf832f281e) - 384G EBS gp3 volume with ext4 FS
- Python 3.12.9
# Clone / Install
git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
git checkout b787b184
pip install -e .
# Before `pull`ing:
time dvc data status
# real 1m10.727s
# user 1m1.733s
# sys 0m7.713s
time dvc data status
# real 1m10.522s
# user 1m1.988s
# sys 0m7.733s
dvc pull # ≈20mins, 62G downloaded to .dvc/cache + 62G applied in worktree
# After `pull`ing:
time dvc data status
# real 0m59.191s
# user 0m51.454s
# sys 0m7.089s
time dvc data status
# real 0m58.269s
# user 0m50.611s
# sys 0m7.263s
Expected
dvc data statusshould be ≈instantaneous when invoked a 2nd time.- If I've only changed one file, it should be correspondingly fast.
- Indexing(?) the first time should be multi-threaded
useris ≈90% ofrealin the timings above, implying mostly single-threaded processing; monitoring inhtopcorroborates this.
Environment information
Output of dvc doctor:
$ dvc doctor
VC version: 3.60.0 (pip)
-------------------------
Platform: Python 3.12.9 on Linux-6.8.0-1029-aws-x86_64-with-glibc2.39
Subprojects:
dvc_data = 3.16.10
dvc_objects = 5.1.1
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.11
Supports:
http (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
https (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
ssh (sshfs = 2025.2.0)
Config:
Global: /home/ubuntu/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, ssh
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c6ba21dd4999cafbf0b94985971d7666
Additional Information (if any):
I'm working on an EXT4 EBS volume:
cat /etc/fstab | column -t -s$'\t'
# LABEL=cloudimg-rootfs / ext4 discard,commit=30,errors=remount-ro 0 1
# LABEL=BOOT /boot ext4 defaults 0 2
# LABEL=UEFI /boot/efi vfat umask=0077 0 1
I also see similar performance on an M1 macbook APFS data volume.
xref https://github.com/iterative/dvc/issues/9428
Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.
That's not expected. DVC does cache hashes based on mtimes as you suggested.
Can you share the profiling result for data status? You can generate that using --cprofile-dump= option to any DVC command (See Generating cProfile data).
dvc data status --cprofile-dump status.prof
dvc data status --cprofile-dump status-noop.prof
Please zip these files and drop it in the GitHub comments. I'll take a look.
If you are interested, you can also take a look at it with snakeviz.
pip install snakeviz
snakeviz status.prof
We are also tracking this with a benchmark named test_data_status in https://bench.dvc.org/latest.html, with 70k files which takes 2.3206s on repeated case.
profs.zip contains 4 .profs:
status{1,2}.prof: generated by in a fresh clone of ctbk.dev, without runningdvc pull:dvc data status --cprofile-dump status1.prof dvc data status --cprofile-dump status2.profstatus{1,2}-pulled.prof: generated after runningdvc pull:dvc data status --cprofile-dump status1-pulled.prof dvc data status --cprofile-dump status2-pulled.prof
All 4 profiles look pretty similar in snakeviz:
- ≈70s in
dvc.repo.data.Status.status - ≈30s in
dvc.repo.data._diff_index_to_wtree
status1.prof:
Thank you for looking into this, I'm happy to hear it's supposed to be fast!
It's weird that I see comparable slowness on [EC2 AMD Ubuntu] and [M1 ARM Macbook].
Looks like you have ~290 import stages in your repository, which is instantiating S3FileSystem eagerly (one filesystem for each stage). That is 48s of the total runtime.
Could you please try this patch and let me know what difference that makes?
pip install "dvc-s3 @ git+https://github.com/iterative/dvc-s3@no-instantiate-s3fs"
Here, we should not be creating filesystems eagerly like this, and even if we did, we are supposed to cache them. I don't see any other issues in the profiling data, so file hashes are not being computed and is being reused from cache.
Thanks, I ran 3x with that patch:
$ time dvc data status --cprofile-dump patch1.prof
No changes.
real 0m29.538s
user 0m25.206s
sys 0m1.425s
$ time dvc data status --cprofile-dump patch2.prof
No changes.
real 0m25.712s
user 0m24.057s
sys 0m1.174s
$ time dvc data status --cprofile-dump patch3.prof
No changes.
real 0m25.298s
user 0m23.649s
sys 0m1.163s
patch-profs.zip contains the 3 .profs. They all look roughly like:
So it seems like your patch sped things up ≈2-3x, but it's still taking 25-30s. Any other ideas?
Thanks for the profiling data. At a quick glance, nothing stands out (except of yaml parsing), everything just seems generally slower.
Please make sure ruamel-yaml-clib is installed, as yaml parsing appears to be slower (~6s).
I'll take a closer look at the profile later today.
I have released a new version of dvc-s3==3.2.2.
Thanks, confirmed I have latest ruamel.yaml.clib (0.2.12) installed.
On my M1 macbook (with dvc-s3==3.2.2), I saw:
$ time dvc data status --cprofile-dump m1-1.prof
...
real 0m17.951s
user 0m15.129s
sys 0m0.945s
$ time dvc data status --cprofile-dump m1-2.prof
...
real 0m16.887s
user 0m15.093s
sys 0m0.868s
I then tried removing the "import stages" (see branch prof), but it didn't seem to matter much:
$ git rm s3/tripdata/*.dvc
$ time dvc data status --cprofile-dump m1-no-imports1.prof
real 0m15.198s
user 0m13.369s
sys 0m0.785s
$ time dvc data status --cprofile-dump m1-no-imports2.prof
real 0m15.371s
user 0m13.394s
sys 0m0.802s
You mentioned:
Looks like you have ~290 import stages in your repository
There's 145 s3/tripdata/*.dvc files (half of 290), each generated by dvc import-url. Not sure why each counts as 2, just noting.
You should be able to reproduce any of the above, without dvc pulling:
git clone -b prof https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
pip install -e .
time dvc data status --cprofile-dump no-pull.prof
I'm seeing similar slowness on M1 macbook and EC2 AMD Ubuntu, so it seems relatively common/universal.
Here's a .zip with 8 profiles:
$ unzip -l profs2.zip
Archive: profs2.zip
Length Date Time Name
--------- ---------- ----- ----
624565 06-19-2025 09:55 m1-1.prof
624833 06-19-2025 09:55 m1-2.prof
623204 06-19-2025 10:42 m1-no-imports1.prof
622960 06-19-2025 10:42 m1-no-imports2.prof
596595 06-19-2025 10:53 ec2-no-imports-no-pull1.prof
594175 06-19-2025 10:53 ec2-no-imports-no-pull2.prof
600119 06-19-2025 10:52 ec2-no-imports1.prof
597880 06-19-2025 10:53 ec2-no-imports2.prof
--------- -------
4884331 8 files
Thanks for your continued help here! The DVC data/versioning model seems awesome, I'd love to be able to use it at scale in my projects.
There's 145
s3/tripdata/*.dvcfiles (half of 290), each generated bydvc import-url. Not sure why each counts as 2, just noting.
Right, DVC reads .dvc/dvc.yaml/dvc.lock files from your current workspace and from Git HEAD.
@ryan-williams, how much time does it take without --cprofile-dump? I am wondering what that overhead is.
I tested with the following commands:
git clone https://github.com/hudcostreets/ctbk.dev /tmp/ctbk.dev
time dvc --cd /tmp/ctbk.dev data status
And, I get a much better result with new dvc-s3 installed on my machine (M4):
________________________________________________________
Executed in 4.82 secs fish external
usr time 4.06 secs 0.10 millis 4.06 secs
sys time 0.42 secs 1.34 millis 0.42 secs
That’s from the first run. The second run is slightly faster, around 4.38s.
On the prof branch, dvc data status is slightly faster.
However, when using --cprofile-dump, the runtime nearly doubles to about 9 seconds. I've attached the profiling data here:
What kind of machine are you running that on?
On M1 macbook, I'm now seeing 6-7s for time dvc data status, and on EC2 AMD Ubuntu 24.04 (ext4 FS) 13-15s.
Both cases corroborate that --cprofile-dump was halving speed. Still seems like there's an order of magnitude that should be possible to shave off. git status in the same repo takes .01-.1s, for comparison (I know it's doing less work, but still, "status" taking >>1s feels bad / interrupts flow).
I'm seeing something similar running dvc status with a folder of ~200k Avro files, and with multiple pipeline stages that depend on this folder, the calculation is being redone each time in dvc status. I don't want to depend on the archived version of the folder since someone might mutate the unzipped one after the fact. Any workarounds or ideas for improving the algorithm?
What kind of machine are you running that on?
On M1 macbook, I'm now seeing 6-7s for
time dvc data status, and on EC2 AMD Ubuntu 24.04 (ext4 FS) 13-15s.Both cases corroborate that
--cprofile-dumpwas halving speed. Still seems like there's an order of magnitude that should be possible to shave off.git statusin the same repo takes .01-.1s, for comparison (I know it's doing less work, but still, "status" taking >>1s feels bad / interrupts flow).
6–7s is a significant improvement over the original 1-minute runtime when the issue was created. While it doesn’t meet the sub-second target you were hoping for, profiling doesn’t show any immediate optimizations given the constraints of our architecture, SQLite, and Python performance.
I am closing this as completed for now.
Thanks for your help here @skshetry!