dvc icon indicating copy to clipboard operation
dvc copied to clipboard

.dvc file specific cache location

Open johan-sightic opened this issue 3 years ago • 7 comments

Scenario

I am working with a few datasets, some are very large and some are quite small, all are tracked by DVC. I have set up a shared cache on a NAS to be able to handle the large datasets, however there is no need to cache the smaller datasets on the NAS. And since I don't always have access to the NAS I would like to cache the smaller locally.

Possible solution

Add the option to specify a different cache location for a specific .dvc file that is not the project wide cache. For example, this could be another option to output-entries or the cache option could be modified to accept a path.

johan-sightic avatar Apr 25 '22 05:04 johan-sightic

A similar feature has already been implemented for remotes: https://github.com/iterative/dvc/pull/6486

johan-sightic avatar Apr 25 '22 06:04 johan-sightic

Any updates on this?

johan-sightic avatar Feb 03 '23 15:02 johan-sightic

@johan-sightic No updates, unfortunately. We do not plan on implementing this ourselves any time soon.

efiop avatar Feb 03 '23 19:02 efiop

@efiop I just found dvc-in-subdirectories which I think I can use instead

johan-sightic avatar Feb 07 '23 10:02 johan-sightic

https://discord.com/channels/485586884165107732/1093474648932491274/1093511063720448092

johan-sightic avatar Apr 06 '23 13:04 johan-sightic

I guess you still have no plans on implementing this? It is still a big pain for us and the subdir solution and other hacks we are trying are far from optimal.

I have been trying to implement this myself but it is not easy to get into the code at this point. Could I have some pointers on what steps to take and which files to modify?

johan-sightic avatar Jan 05 '24 14:01 johan-sightic

@johan-sightic Sorry, no plans from our side :( We are a small team and have our own priorities that we are trying to keep up with at this moment.

Regarding pointers, I'm sorry I can't slice it up that fine to the point of particular files, but I would recommend looking into how something simple like dvc checkout works and going from there. The key point these days is really Index and DataIndex that we build as a part of it. See them and _load_storage_from_outs in dvc/repo/index.py. Also notice that we have a remote per output feature supported (see remote in https://dvc.org/doc/user-guide/project-structure/dvc-files), which should be fairly similar to cache per output that you are trying to achieve.

efiop avatar Jan 05 '24 14:01 efiop