dvc
dvc copied to clipboard
dvc: QA local/nfs/cifs/overlay/etc
When working with local repos, dvc uses a lot of exists()
and stat()
calls, which is fine with normal filesystems, but could be extremely slow with filesystems like nfs/cifs/etc (e.g. stat()
can be 100x slower). We should look into conserving such calls, similar to what we do with remotes, where we know that an api call to s3/gdrive/etc is pretty slow, so we try to do as little calls as possible.
- https://discord.com/channels/485586884165107732/728693131557732403/818168922393673738
Another issue with nfs/cifs/overlay is problems with sqlite database (e.g. freezeing locks) that we use for local optimization https://github.com/iterative/dvc/issues/4420 . Dvc itself is also often affected by it, which is why we provide an option to use flufl.lock
as repo lock instead of flock-based lock. For sqlite it is not an option, so a solution might be to place it somewhere on a normal filesystem (e.g. /tmp is usually fine).
Idk if this is the right issue or if we need another one but users are reporting DVC 2.0 to be very slow with basic operations. Sounds like something to prioritize perhaps?
Examples (both using APFS):
- https://discord.com/channels/485586884165107732/563406153334128681/819272964926865451
- https://discord.com/channels/485586884165107732/563406153334128681/819291352244289548
Also @dberenbaum @efiop @shcheklein WDYT about establishing some non-functional requirements such as basic operation speeds and designing some way to test them at least for major and minor releases (no patches)? Thanks
Sorry, that should definitely be a separate issue.
@jorgeorpinel Just for the record: Both links are not related to this issue.
Also @dberenbaum @efiop @shcheklein WDYT about establishing some non-functional requirements such as basic operation speeds and designing some way to test them at least for major and minor releases (no patches)? Thanks
We have dvc-bench that tests scenarios end-to-end, we are filling it with stuff from time to time.
Looking forward to this feature...
https://discord.com/channels/485586884165107732/563406153334128681/821510730565025823
Any update on this issue? I see it's been delayed a few times.
Hey @brbarkley , thanks for the interest! We are currently actively working on improving the internals in order for them to accommodate the optimizations that will be the result of this QA. So far ETA for the active phase of this ticket is around the beginning of June.
For the record, from Discord:
We have a shared DVC cache on a NAS (NFS mount). We’re running into a problem where DVC operations like checkout, commit, etc. are taking a very long time (in some cases days), seemingly due to hashing. I remember reading in some ticket that DVC cache ops are not optimized for network mounts. What are your current plans regarding this situation? DVC cache is not feasible to store outside of NAS due to its size.
@efiop Do you have an updated ETA? I was having another issue described here that turned into this issue after upgrading to 2.3.0 on Friday. My team is doing a large amount of restructuring of storage for our projects in 2 weeks and I'd like to have a plan or workaround when we start. Is it likely that a version of DVC that works in this type of environment will be released in that time frame?
Hi @jmblackmer . Thank you for your interest! As you've noticed, thanks to @agurtovoy we have #6111 :pray: that solves those particular issues. I can't give an ETA on it yet, as there are a few questions that we need to discuss there first. Just to make it clear, #6111 solves the issues, but we just want to make sure that we are providing the most seamless solution or have a clear plan to improve it in the future.
Hi @efiop , thanks for the great tool you guys are building . I have a question, from my test recently, it is working fine if I put cache or remote storage in a mounted NFS volume, but if the workspace folder is in an NFS volume, dvc add
will be super slow almost frozen and if I terminate it, some data will be deleted and lost, I just want to confirm here that this is expected behavior up to version 2.5.0 ? If so, is there any hack to mitigate this issue right now? Thanks.
Hi @ZhengRui That sounds correct, yes. Using nfs only for cache (or as a local remote) is the workaround. Seeming data loss is just dvc not transferring everything completely, but the files should be intact in .dvc/cache
if they are no longer present in the workspace. We'll be improving the rollback mechanism in the near future, so stay tuned :slightly_smiling_face:
Thanks @efiop , looking forward to the new features :) keep up the great work !
Seeking clarification on status of this issue. From above, @efiop writes:
As you've noticed, thanks to @agurtovoy we have #6111 🙏 that solves those particular issues. I can't give an ETA on it yet, as there are a few questions that we need to discuss there first. Just to make it clear, #6111 solves the issues, but we just want to make sure that we are providing the most seamless solution or have a clear plan to improve it in the future.
I see PR #6111 was closed in favor of PR #6419, which was merged on 9/18/21. Does that mean this issue #5562 is resolved in the v2.8.0 release on 10/11/21?
Thanks!
@brbarkley Correct. This particular issue is about other things too, like optimizing, which is on our todo list.
thank @efiop. So, is having shared cache and working directory both on NFS now possible with v2.8?
@brbarkley It is, but it is not great in terms of ui/performace. You need to use these config options core.hardlink_lock
, state.dir
(should be pointing to a normal filesystem, e.g. /tmp/something
) and index.dir
(same requirements as for state.dir
). We plan on improving on it in the future versions.
ok, thanks for those tips. I will test it out and see whether it's worth bumping to v2.8 or staying at v1.11...
@efiop looks like it's best to stay at v1.11 if shared cache and workspace are both on nfs/cifs. At v2.8, it took about 5 minutes to compute hash of directory with 100 files whereas it took a few seconds on v1.11.
Thanks again for your continued work! Hope I can move to v2.X soon...
That's what I meant :( We are working hard on pre-requisites, stay tuned.
Two related issues on discord:
https://discord.com/channels/485586884165107732/485596304961962003/932563413777924096 https://discord.com/channels/485586884165107732/485596304961962003/932663444329615420
Happy to see that there has been some progress on this. This unfortunately makes dvc pretty much unsusale for me at the moment, except for small-data projects.
Also @brbarkley , 5 mins sounds great... I have been waiting over an hour for dvc to compute hashes on a directory with 1000+ files....
Upgrading to DVC 2.9.x and adding the config options suggested above seems to resolve this.
Hi! what is the latest news on this issue? Is it fixed? Updating and using the suggested config options fixed the dvc init
error but it is still very slow.
Hi! what is the latest news on this issue? Is it fixed? Updating and using the suggested config options fixed the
dvc init
error but it is still very slow.
We had updated several versions to improve it, but still not solved for now.
Could you please provide some more details? the info from dvc doctor
@karajan1001 of course:
dvc doctor
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24)
Maybe I have unrealistic time expectations, how long should i expect dvc add
to take for a dataset of 139 GB with 1100 files to take?
@EyescannerJE Could you please run dvc doctor
inside of a dvc repository? It will have additional info about your setup.
Regarding the timing, it highly depends on the particular setup. We don't yet have benchmarks for NFS(or other similar setups), so I can't even give you an approximate estimate. We currently only have regular benchmarks for around ~20K image dataset on regular local filesystem, the results can be seen here https://docs.iterative.ai/dvc-bench/ We will be introducing bigger datasets to our benchmarks in the future. https://github.com/iterative/dvc-bench/issues/306
@efiop I see, here is the output of dvc doctor
from inside the dvc repository
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24)
Cache types: hardlink
Cache directory: cifs on //192.168.50.122/data
Caches: local
Remotes: local
Workspace directory: cifs on //192.168.50.122/data
Repo: dvc, git
Some benchmarks for NAS setups would be very much appreciated! Adding the 139 GB of data took 5 hours for me
I was trying out the state.dir
and index.dir
options and this does indeed solve the performance issue. However, I was wondering about another issue ... does this mean that I now have to do all my dvc checkout
operations from the same machine (as /tmp is not shared across a cluster)? Is there a workaround for this?
Thanks!
I was trying out the
state.dir
andindex.dir
options and this does indeed solve the performance issue. However, I was wondering about another issue ... does this mean that I now have to do all mydvc checkout
operations from the same machine (as /tmp is not shared across a cluster)? Is there a workaround for this?Thanks!
Hi, @hhoeflin , These files are only local database caches (store some metadata), not the file cache. So it will not affect your checkout
operation in other machines.
@karajan1001 Thanks for your answer. However from:
https://dvc.org/doc/user-guide/project-structure/internal-files
it says
.dvc/tmp/links: This directory is used to clean up your workspace when calling
[dvc checkout](https://dvc.org/doc/command-reference/checkout).
It contains a SQLite state database that stores a list of file links created by DVC (from cache to workspace)
This would indicate that when doing a checkout from another machine, the cleanup would not work correctly, no?
And similarly on the same machine, if /tmp gets deleted after a while, the cleanup of checkout would be affected as well?