dvc
dvc copied to clipboard
DVC do not cache output of pipeline properly
Bug Report
repro: doesn't cache output properly with reflink setup.
Description
I have 4 pipeline to transform the same input dataset for different tasks. The images was process the same way, and the cache.type was setting to reflink. So, according to the document, there should be only one copy of the output images. But this is not the truth. All output of the pipeline was not set to reflink with the cached file.
If I run the dvc checkout -R --reflink after the pipelink was executed. Then the disk usage behavior normally.
The output of btrfs fi du -s . right after repro:
Total Exclusive Set shared Filename
90.50GiB 31.21GiB 29.64GiB .
The output of btrfs fi du -s . right after dvc checkout -R --reflink:
Total Exclusive Set shared Filename
90.50GiB 1.07GiB 29.83GiB .
Reproduce
Expected
Environment information
Output of dvc doctor:
$ dvc doctor
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-6.7.10-060710-generic-x86_64-with-glibc2.39
Subprojects:
Supports:
azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
gdrive (pydrive2 = 1.20.0),
gs (gcsfs = 2024.6.1),
hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
ssh (sshfs = 2024.6.0),
webdav (webdav4 = 0.10.0),
webdavs (webdav4 = 0.10.0),
webhdfs (fsspec = 2024.6.1)
Config:
Global: /home/fkwong/.config/dvc
System: /etc/xdg/xdg-ubuntu/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/055b5579042ae6f272efc40fc232cbdd
Additional Information (if any):
Can you try removing hardlink and symlink from cache types config? You can remove the cache.type config entirely as reflink, copy is the default.
It'd be great if you could debug, and see why it's not being reflinked by adding a breakpoint here in dvc-objects.
https://github.com/iterative/dvc-objects/blob/716dba66f1687162f12ec85b08959196709111e0/src/dvc_objects/fs/generic.py#L337
You can also try with the following snippets and see if they are getting reflinked:
from dvc.fs import LocalFileSystem
fs = LocalFileSystem()
fs.reflink("existing-file", "cloned-file")
@skshetry I almost find the reason. When multiple pipeline create the same output. Only the first one got reflink.
Here is a screenshot that 5 pipelines operate on a one-image dataset. And I use filefrag to check the status of the output image. You can see that only the first file is set to shared.
I am able to create a minimal reproducted project.
-
A clean project.
Total Exclusive Set shared Filename 1.83MiB 1.83MiB 0.00B . -
Add raw image:
dvc add data/raw/input/testing.jpgTotal Exclusive Set shared Filename 3.64MiB 16.00KiB 1.81MiB . -
repro pipeline in the first time:
dvc reproTotal Exclusive Set shared Filename 12.70MiB 9.08MiB 1.81MiB . -
check with filefrag:
Filesystem type is: 9123683e File size of data/prepared/myrepo-one/testing.jpg is 1897127 (464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 463: 640195273.. 640195736: 464: last,eof data/prepared/myrepo-one/testing.jpg: 1 extent found Filesystem type is: 9123683e File size of data/prepared/myrepo-two/testing.jpg is 1897127 (464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 463: 640317096.. 640317559: 464: last,eof data/prepared/myrepo-two/testing.jpg: 1 extent found Filesystem type is: 9123683e File size of data/prepared/myrepo-three/testing.jpg is 1897127 (464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 463: 640213630.. 640214093: 464: last,eof data/prepared/myrepo-three/testing.jpg: 1 extent found Filesystem type is: 9123683e File size of data/prepared/myrepo-four/testing.jpg is 1897127 (464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 463: 640287558.. 640288021: 464: last,eof data/prepared/myrepo-four/testing.jpg: 1 extent found Filesystem type is: 9123683e File size of data/prepared/myrepo-five/testing.jpg is 1897127 (464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 463: 640195737.. 640196200: 464: last,eof data/prepared/myrepo-five/testing.jpg: 1 extent found -
dvc checkout -R --reflink:Total Exclusive Set shared Filename 12.70MiB 16.00KiB 1.81MiB .
The reason why the first file on the screenshot provided is shared and all the files in the reproducted project is not shared, because the pipeline in the screenshot have changed the input image. So we could draw a conclusion that if a output file is already in the dvc cache. Then dvc won't create a reflink from the cache version to the workspace version.
I think this is due to a relink optimization that I did recently for checkout (which is used during repro): https://github.com/iterative/dvc-data/pull/548.
DVC looks at the file in the workspace, and tries to determine if it needs to relink based on cache-types. So, for example, if a file is a not a symlink, and you have cache_type = symlink set, it'll have to relink via symlink.
But, DVC does not have a way to determine if a file should be reflinked or not. So, it leaves it as-is in the workspace, which saves us from doing checkout which can be expensive.
If you are worried about storage, I think dvc checkout --relink is a correct fix.
filefrag is able to check whether a file if reflink or not.
I have a solution. When cache.type = reflink, dvc perform checkout with touch when workspace version equal to the cached version, that is, make a reflink from the cache to the workspace and update the created time on the cache. In that case, the timestamp of the cached version should always newer than the workspace one. The pesedu code should be:
if cache.type == reflink:
if md5sum of cached file == md5sum of workspace file:
if timestamp of cached file older than workspace file:
create reflink from cache to workspace and touch the cache to update timestamp
else:
do nothing
else:
create reflink from cache to workspace and touch the cache to update timestamp
Besides, I don't think this is an issue can be ignored. Even there is no multiple pipeline to generate the same output, if user updates some existing pipeline to generate a new output with most of the files is same as those in the cache. All those files will be duplicated in the cache and the workspace.
I maybe open to some config to force-relink. Any thoughts @dberenbaum, @shcheklein?
just to clarify, better understand things first folks, a few questions:
filefrag is able to check whether a file if reflink or not.
do we know how it does this? is it FS specific or is there a general sys call that can do this? Is it expensive or not?
@skshetry if we had a call isreflink - would that help? (I assume it would, right?)
So, it leaves it as-is in the workspace, which saves us from doing checkout which can be expensive.
could you clarify a bit - is it expensive because we would do a full output checkout (all files), since we can't detect the difference?
we still traverse and check the link type, right? would be the same or less expensive in case of reflinks specifically to force relink right away w/o doing those checks?
FYI, https://github.com/tytso/e2fsprogs/blob/950a0d69c82b585aba30118f01bf80151deffe8c/misc/filefrag.c#L269, this line is where the filefrag get the file flag.