dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

how: use DVC when data is stored in an external drive

Open dashohoxha opened this issue 5 years ago • 15 comments

E: Check whether #520 was done first... See also #899


This doc should explain the best solution (or a couple of possible solutions) for this situation.

Example: the data is located in a partition of size 16TB on an external drive, while the DVC project is on /home of a partition of size 320GB.

Context: https://discordapp.com/channels/485586884165107732/485596304961962003/611244643685892153

dashohoxha avatar Aug 15 '19 17:08 dashohoxha

@shcheklein can I give this a try?

dashohoxha avatar Aug 15 '19 17:08 dashohoxha

These are related/solve similar problems:

https://github.com/iterative/dvc.org/pull/455 (fixes #103 ) https://dvc.org/doc/use-cases/multiple-data-scientists-on-a-single-machine

Keep in mind:

https://github.com/iterative/dvc.org/issues/497

shcheklein avatar Aug 15 '19 17:08 shcheklein

@shcheklein can I give this a try?

@dashohoxha absolutely! just take a look at those tickets I mentioned above ^^ They potentially overlap, for one of them a PR is almost done.

shcheklein avatar Aug 15 '19 17:08 shcheklein

The solution described by @efiop (tracking a data file that is external (outside the dvc project)) seems to be a different solution. Having a remote DVC cache (same as multiple-users-on-a-single-machine) is another solution. The NFS case seems to have a similar solution to multiple-users-on-a-single-machine.

dashohoxha avatar Aug 15 '19 17:08 dashohoxha

@dashohoxha gotcha. This is a different one indeed. This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

Also, keep in mind. My take on this that there should be a very strong reason to complicate your workflow with external deps/outs/cache in case of multiple drives. As I mentioned on Discord, I think in most cases the ideal scenario is to use external cache and symlinks (similar to NFS, shared cache scenarios).

shcheklein avatar Aug 15 '19 17:08 shcheklein

This sections - https://dvc.org/doc/user-guide/external-outputs and this one https://dvc.org/doc/user-guide/external-dependencies should be reorganized/taken into account.

They seem accurate to me (unless there is some missing information that I don't know). The problem is that it is difficult for the user to read all the details and intricacies on user guides and manual pages, and find the best solution for his case. Showing him what the best solution would be in a particular case (or a similar case) should be helpful.

dashohoxha avatar Aug 16 '19 15:08 dashohoxha

@dashohoxha your PR looks good, there are some improvements can be done which I'll review and let you know, but first I would like to understand the "use case" itself better, what are possible solution for that "use case", how should we improve those sections in User Guide, how all this stuff corresponds wish the shared machine case (when there is a single cache setup on a separate partition). Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

To give just some concerns:

  1. Huge data on external local drive title. It's a very confusing title for the use case. Starting from the "external local drive" (is external or local after all?) to the way it's formulated (huge data is not a problem, probably, versioning it or managing it is a problem). Huge is a very vague term as well. Some people use a single huge drive for everything.

Some better titles from the top of my head: Managing data storage on a separate drive, Versioning data and processing data outside your repo, etc ...

  1. No matter how good we can come with the name there should be some integration with other parts of the docs (user guide, versioning examples). For examples, in most cases we assume that is part of your workspace. Why don't we clarify somehow that if your data is substantially large there are ways to manage it "externally".

  2. Back to the use case. It's basically about trying to version files that are located on the second large drive (it can be second large HDD, it can be some shared NAS, etc - the point is it's a second large volume with tons of data and tons of space on it). Using external outs/deps is not the only way to deal with this. It's also not ideal. Should we include in this use case different ways of doing this - like "local external cache" + links? They overlap substantially to my mind.

  3. User Guide part of it. If use case (especially title) should be written in a way that will immediately match with user's request (rule of thumb - what words would I use to describe this situation in case I would need to ask a question on chat?), then User Guide is more like a well structured manual. For example, "Managing External Data" is a good section that should actually combine external deps, external outs, some intro and overview of the use cases with links and instructions on how specific cases could be solved.

So, let's please, discuss and understand some strategy behind this.

@jorgeorpinel would love to hear your opinion on this.

shcheklein avatar Aug 16 '19 22:08 shcheklein

Without this holistic plan, we are potentially duplicating information, we are not properly communicating the use case, and we are not properly structuring User Guide.

Yes!

It's funny because I've been noticing significant confusion around external X topics so I opened #566 recently. I also feel like we may need to regroup and figure out the connections between all the external data stuff before deciding which docs to change.

That said it's good to have more use cases and I'll review the PR but if we don't figure out the big picture, this doc may only add to the confusion of some users, like Dashamir mentioned in https://github.com/iterative/dvc.org/issues/563#issuecomment-522054626.

jorgeorpinel avatar Aug 17 '19 01:08 jorgeorpinel

Questions about https://github.com/iterative/dvc.org/issues/563#issuecomment-522171263 @shcheklein:

  1. No matter how good we can come with the name there should be some integration with other parts of the docs (user guide, versioning examples)... Why don't we clarify somehow that if your data is substantially large there are ways to manage it "externally".

Do you mean to add notes and links in all other documents where it can be useful?

  1. User Guide part of it...

Similar question. Are you suggesting Dashamir to accordingly update existing user guides with the same PR (#565)?

jorgeorpinel avatar Aug 17 '19 01:08 jorgeorpinel

@shcheklein I believe I get your point. These discussions certainly help me to think about the problems and to look for solutions.

I am still trying to understand DVC and figure out any problems with the docs, what can be improved etc. I am also trying to follow the discussions (as much as I can), which may help me with understanding the problems. The simple tasks that I try to do are just to get familiar with the workflow, the tools, the community, and with DVC itself and its docs (of course).

So, I don't have any quick answers yet. Are you asking me to finish the hard part of the job without even starting yet? :)

dashohoxha avatar Aug 17 '19 10:08 dashohoxha

@dashohoxha not at all! it was not even a critique of your PR, it was an attempt from my end to systematize my thoughts about the current state of the stuff related to the external data management, external cache, NFS, etc, and come up with some initial strategy.

I'll review the latest changes asap (we are traveling now, so please give us a bit more time).

shcheklein avatar Aug 20 '19 05:08 shcheklein

I think that at this point it's unclear whether a how-to is needed and most of the content will be covered by #520? Can we close this @shcheklein ? Thanks

jorgeorpinel avatar May 17 '21 02:05 jorgeorpinel

Might be. It's not clear to me how will #520 evolve. This is one is quite precise and I would close when we clearly see that it is addressed (by #520 or whatever else). And this one is important indeed. Might be more important than clarifying --external, for example

shcheklein avatar May 17 '21 02:05 shcheklein

My take on this in general is that you have 4 routes when working with data from external drives:

  1. Download it (get, import(-url)) -- not useful when the local drive is smaller than the data
  2. Manage it in-place with an external cache Potentially in a shared cache See also https://github.com/iterative/dvc.org/issues/520
  3. Transfer it directly to remote storage to use later in an env with a larger drive or appropriate cache setup.
  4. Ad hoc methods like virtually mounting external folders inside the the DVC project dir, or other "tricks".

Other than # 4 which we probably don't need to document, we have info. about all of this in docs. We may just need to consolidate it somewhere in the future Data Management guides. I added bullet there and with that and #520 I think we should close this as redundant.

jorgeorpinel avatar Jul 28 '22 22:07 jorgeorpinel

Back to the use case. It's basically about trying to version files that are located on the second large drive (it can be second large HDD, it can be some shared NAS, etc

Should we repurpose this ticket to focus specifically on managing external data on NAS? @shcheklein

More details in https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180 (top forum question)

jorgeorpinel avatar Jul 29 '22 02:07 jorgeorpinel

We made updates to the guide about external data as part of the 3.0 release, so closing since I don't see additional actions we can take right now. Feel free to reopen if I missed something.

dberenbaum avatar Oct 17 '23 00:10 dberenbaum