dvc dvc: QA local/nfs/cifs/overlay/etc

When working with local repos, dvc uses a lot of exists() and stat() calls, which is fine with normal filesystems, but could be extremely slow with filesystems like nfs/cifs/etc (e.g. stat() can be 100x slower). We should look into conserving such calls, similar to what we do with remotes, where we know that an api call to s3/gdrive/etc is pretty slow, so we try to do as little calls as possible.

https://discord.com/channels/485586884165107732/728693131557732403/818168922393673738

Another issue with nfs/cifs/overlay is problems with sqlite database (e.g. freezeing locks) that we use for local optimization https://github.com/iterative/dvc/issues/4420 . Dvc itself is also often affected by it, which is why we provide an option to use flufl.lock as repo lock instead of flock-based lock. For sqlite it is not an option, so a solution might be to place it somewhere on a normal filesystem (e.g. /tmp is usually fine).

Mar 07 '21 18:03 efiop

Idk if this is the right issue or if we need another one but users are reporting DVC 2.0 to be very slow with basic operations. Sounds like something to prioritize perhaps?

Examples (both using APFS):

https://discord.com/channels/485586884165107732/563406153334128681/819272964926865451
https://discord.com/channels/485586884165107732/563406153334128681/819291352244289548

Also @dberenbaum @efiop @shcheklein WDYT about establishing some non-functional requirements such as basic operation speeds and designing some way to test them at least for major and minor releases (no patches)? Thanks

Sorry, that should definitely be a separate issue.

Mar 10 '21 20:03 jorgeorpinel

@jorgeorpinel Just for the record: Both links are not related to this issue.

Also @dberenbaum @efiop @shcheklein WDYT about establishing some non-functional requirements such as basic operation speeds and designing some way to test them at least for major and minor releases (no patches)? Thanks

We have dvc-bench that tests scenarios end-to-end, we are filling it with stuff from time to time.

Mar 10 '21 21:03 efiop

Looking forward to this feature...

https://discord.com/channels/485586884165107732/563406153334128681/821510730565025823

Mar 16 '21 22:03 brbarkley

Any update on this issue? I see it's been delayed a few times.

May 19 '21 16:05 brbarkley

Hey @brbarkley , thanks for the interest! We are currently actively working on improving the internals in order for them to accommodate the optimizations that will be the result of this QA. So far ETA for the active phase of this ticket is around the beginning of June.

May 19 '21 16:05 efiop

For the record, from Discord:

We have a shared DVC cache on a NAS (NFS mount). We’re running into a problem where DVC operations like checkout, commit, etc. are taking a very long time (in some cases days), seemingly due to hashing. I remember reading in some ticket that DVC cache ops are not optimized for network mounts. What are your current plans regarding this situation? DVC cache is not feasible to store outside of NAS due to its size.

Jun 03 '21 21:06 shcheklein

@efiop Do you have an updated ETA? I was having another issue described here that turned into this issue after upgrading to 2.3.0 on Friday. My team is doing a large amount of restructuring of storage for our projects in 2 weeks and I'd like to have a plan or workaround when we start. Is it likely that a version of DVC that works in this type of environment will be released in that time frame?

Jun 07 '21 14:06 jmblackmer

Hi @jmblackmer . Thank you for your interest! As you've noticed, thanks to @agurtovoy we have #6111 :pray: that solves those particular issues. I can't give an ETA on it yet, as there are a few questions that we need to discuss there first. Just to make it clear, #6111 solves the issues, but we just want to make sure that we are providing the most seamless solution or have a clear plan to improve it in the future.

Jun 09 '21 09:06 efiop

Hi @efiop , thanks for the great tool you guys are building . I have a question, from my test recently, it is working fine if I put cache or remote storage in a mounted NFS volume, but if the workspace folder is in an NFS volume, dvc add will be super slow almost frozen and if I terminate it, some data will be deleted and lost, I just want to confirm here that this is expected behavior up to version 2.5.0 ? If so, is there any hack to mitigate this issue right now? Thanks.

Jul 06 '21 13:07 ZhengRui

Hi @ZhengRui That sounds correct, yes. Using nfs only for cache (or as a local remote) is the workaround. Seeming data loss is just dvc not transferring everything completely, but the files should be intact in .dvc/cache if they are no longer present in the workspace. We'll be improving the rollback mechanism in the near future, so stay tuned :slightly_smiling_face:

Jul 06 '21 20:07 efiop

Thanks @efiop , looking forward to the new features :) keep up the great work !

Jul 07 '21 01:07 ZhengRui

Seeking clarification on status of this issue. From above, @efiop writes:

As you've noticed, thanks to @agurtovoy we have #6111 🙏 that solves those particular issues. I can't give an ETA on it yet, as there are a few questions that we need to discuss there first. Just to make it clear, #6111 solves the issues, but we just want to make sure that we are providing the most seamless solution or have a clear plan to improve it in the future.

I see PR #6111 was closed in favor of PR #6419, which was merged on 9/18/21. Does that mean this issue #5562 is resolved in the v2.8.0 release on 10/11/21?

Thanks!

Oct 12 '21 18:10 brbarkley

@brbarkley Correct. This particular issue is about other things too, like optimizing, which is on our todo list.

Oct 12 '21 18:10 efiop

thank @efiop. So, is having shared cache and working directory both on NFS now possible with v2.8?

Oct 12 '21 19:10 brbarkley

@brbarkley It is, but it is not great in terms of ui/performace. You need to use these config options core.hardlink_lock, state.dir(should be pointing to a normal filesystem, e.g. /tmp/something) and index.dir(same requirements as for state.dir). We plan on improving on it in the future versions.

Oct 12 '21 19:10 efiop

ok, thanks for those tips. I will test it out and see whether it's worth bumping to v2.8 or staying at v1.11...

Oct 12 '21 20:10 brbarkley

@efiop looks like it's best to stay at v1.11 if shared cache and workspace are both on nfs/cifs. At v2.8, it took about 5 minutes to compute hash of directory with 100 files whereas it took a few seconds on v1.11.

Thanks again for your continued work! Hope I can move to v2.X soon...

Oct 13 '21 20:10 brbarkley

That's what I meant :( We are working hard on pre-requisites, stay tuned.

Oct 13 '21 20:10 efiop

Two related issues on discord:

https://discord.com/channels/485586884165107732/485596304961962003/932563413777924096 https://discord.com/channels/485586884165107732/485596304961962003/932663444329615420

Jan 17 '22 17:01 daavoo

Happy to see that there has been some progress on this. This unfortunately makes dvc pretty much unsusale for me at the moment, except for small-data projects.

Jan 18 '22 22:01 bgroenks96

Also @brbarkley , 5 mins sounds great... I have been waiting over an hour for dvc to compute hashes on a directory with 1000+ files....

Jan 18 '22 23:01 bgroenks96

Upgrading to DVC 2.9.x and adding the config options suggested above seems to resolve this.

Jan 19 '22 08:01 bgroenks96

Hi! what is the latest news on this issue? Is it fixed? Updating and using the suggested config options fixed the dvc init error but it is still very slow.

Apr 11 '22 14:04 johan-sightic

Hi! what is the latest news on this issue? Is it fixed? Updating and using the suggested config options fixed the dvc init error but it is still very slow.

We had updated several versions to improve it, but still not solved for now. Could you please provide some more details? the info from dvc doctor

Apr 12 '22 08:04 karajan1001

@karajan1001 of course:

dvc doctor
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
	hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
	webhdfs (fsspec = 2022.1.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.1.0, boto3 = 1.20.24)

Maybe I have unrealistic time expectations, how long should i expect dvc add to take for a dataset of 139 GB with 1100 files to take?

Apr 12 '22 14:04 johan-sightic

@EyescannerJE Could you please run dvc doctor inside of a dvc repository? It will have additional info about your setup.

Regarding the timing, it highly depends on the particular setup. We don't yet have benchmarks for NFS(or other similar setups), so I can't even give you an approximate estimate. We currently only have regular benchmarks for around ~20K image dataset on regular local filesystem, the results can be seen here https://docs.iterative.ai/dvc-bench/ We will be introducing bigger datasets to our benchmarks in the future. https://github.com/iterative/dvc-bench/issues/306

Apr 12 '22 15:04 efiop

@efiop I see, here is the output of dvc doctor from inside the dvc repository

DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
	hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
	webhdfs (fsspec = 2022.1.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.1.0, boto3 = 1.20.24)
Cache types: hardlink
Cache directory: cifs on //192.168.50.122/data
Caches: local
Remotes: local
Workspace directory: cifs on //192.168.50.122/data
Repo: dvc, git

Some benchmarks for NAS setups would be very much appreciated! Adding the 139 GB of data took 5 hours for me

Apr 13 '22 05:04 johan-sightic

I was trying out the state.dir and index.dir options and this does indeed solve the performance issue. However, I was wondering about another issue ... does this mean that I now have to do all my dvc checkout operations from the same machine (as /tmp is not shared across a cluster)? Is there a workaround for this?

Thanks!

Apr 25 '22 12:04 hhoeflin

I was trying out the state.dir and index.dir options and this does indeed solve the performance issue. However, I was wondering about another issue ... does this mean that I now have to do all my dvc checkout operations from the same machine (as /tmp is not shared across a cluster)? Is there a workaround for this?

Thanks!

Hi, @hhoeflin , These files are only local database caches (store some metadata), not the file cache. So it will not affect your checkout operation in other machines.

Apr 25 '22 12:04 karajan1001

@karajan1001 Thanks for your answer. However from:

https://dvc.org/doc/user-guide/project-structure/internal-files

it says

.dvc/tmp/links: This directory is used to clean up your workspace when calling 
[dvc checkout](https://dvc.org/doc/command-reference/checkout). 
It contains a SQLite state database that stores a list of file links created by DVC (from cache to workspace)

This would indicate that when doing a checkout from another machine, the cleanup would not work correctly, no?

And similarly on the same machine, if /tmp gets deleted after a while, the cleanup of checkout would be affected as well?

Apr 25 '22 13:04 hhoeflin

dvc dvc copied to clipboard

dvc: QA local/nfs/cifs/overlay/etc

dvc
dvc copied to clipboard