dvc
dvc copied to clipboard
.dvc file specific cache location
Scenario
I am working with a few datasets, some are very large and some are quite small, all are tracked by DVC. I have set up a shared cache on a NAS to be able to handle the large datasets, however there is no need to cache the smaller datasets on the NAS. And since I don't always have access to the NAS I would like to cache the smaller locally.
Possible solution
Add the option to specify a different cache location for a specific .dvc file that is not the project wide cache. For example, this could be another option to output-entries or the cache option could be modified to accept a path.
A similar feature has already been implemented for remotes: https://github.com/iterative/dvc/pull/6486
Any updates on this?
@johan-sightic No updates, unfortunately. We do not plan on implementing this ourselves any time soon.
@efiop I just found dvc-in-subdirectories which I think I can use instead
https://discord.com/channels/485586884165107732/1093474648932491274/1093511063720448092
I guess you still have no plans on implementing this? It is still a big pain for us and the subdir solution and other hacks we are trying are far from optimal.
I have been trying to implement this myself but it is not easy to get into the code at this point. Could I have some pointers on what steps to take and which files to modify?
@johan-sightic Sorry, no plans from our side :( We are a small team and have our own priorities that we are trying to keep up with at this moment.
Regarding pointers, I'm sorry I can't slice it up that fine to the point of particular files, but I would recommend looking into how something simple like dvc checkout works and going from there. The key point these days is really Index and DataIndex that we build as a part of it. See them and _load_storage_from_outs in dvc/repo/index.py. Also notice that we have a remote per output feature supported (see remote in https://dvc.org/doc/user-guide/project-structure/dvc-files), which should be fairly similar to cache per output that you are trying to achieve.