DrWatson.jl
DrWatson.jl copied to clipboard
Flexible paths to the datasets
In my experience, the requirement to have all project-related data in a specific folder is a serious constraint. In computational biology we often use one dataset for multiple projects, and datasets can take dozens of gigabytes. So it's impractical to copy each dataset for each new project. What do you think about extending the datadir function to allow paths outside of the project?
Some time ago I made an R package pursuing the same goal (dataorganizer), and there I used data_mapping.yml file to allow adjusting the paths to folders. It also has a function DatasetPath, which can be a shortcut to some subfolders/files inside the data folder, or to some completely different files. I understand that having data scattered around could make the code less reproducible. But it's not the default, and in the use-case I described, the data has to be properly downloaded from scattered resources anyway, as it can't be stored inside the git repo.
Hi there,
I may not understood exactly what is the "problem".
the requirement to have all project-related data in a specific folder is a serious constraint
Indeed that is a serious constraint, which is why DrWatson doesn't enforce anything like that. Noone says that you should absolutely have your data in the data folder. Where did you get this idea from?
What do you think about extending the
datadirfunction to allow paths outside of the project?
I have difficulty imagining what this means. If you want a folder outside the project directory why not just.... you know, give the path to that folder in your load function...?
Seems to me that the easiest way to solve your "problem" (not a problem in my eyes, but something reasonable that happens in several of my personal projects) is to make a symlink in your data folder. I don't do that by the way, I just define a constant directory in my source folder, like CERES_DIR = "whatever/directory/to/large/local/file" and use it.
Reading a bit the repo you paste, seems like you are suggesting to create some kind of "folder mapping" functionality, so that some folders (like data) can be made to actually mean some other folder somewhere else, and each user configures what "somewhere else" means?
Hi @Datseris , Thanks for the answer!
Noone says that you should absolutely have your data in the data folder. Where did you get this idea from?
I mean, there is no suggested solution to deal with such paths in a way that preserve reproducibility of the notebooks. Because putting these paths to the notebooks is definitely not an option if one wants to share the code. Please, just point me to a corresponding documentation part if I missed it.
I just define a constant directory in my source folder, like CERES_DIR = "whatever/directory/to/large/local/file" and use it.
Agree, it's possible to store it in the src folder, I didn't think of it. Thanks! Though for multiple datasets in different locations it would require creating different global variables or a dictionary of paths (e.g. DATASETS), and accessing it would require calling joinpath(DATASETS["dataset_name"], "subfolder", "file.ext"). So it probably would make sense for an user to define a function datasetdir(dataset_name::AbstractString, args...) = joinpath(DATASETS[dataset_name], args...). And if I want to support users who just downloaded all data to subfolders of "data", I'd need to also support dataset paths, relative to the datadir(), which complicates the function. Not a big deal, of course, but in my opinion, addressing this problem on the side of DrWatson would be more elegant. And so would be storing paths in a standardized file instead of some random place in code. :)
Seems to me that the easiest way to solve your "problem" is to make a symlink in your data folder
Explaining this to a user who just wants to reproduce a part of your research seems like an unnecessary complication to me. Instead of providing a template file with the names of all datasets I used in the code and a comment "PUT THE PATH TO THE DATA HERE" I'll have to write a full readme "If you store the dataset X in a different folder, please put a symlink to the directory named ...".
some folders (like data) can be made to actually mean some other folder somewhere else, and each user configures what "somewhere else" means?
Yep. Perhaps, I wouldn't bother for having this file just to reconfigure the data folder, but having it in addition to the individual paths for datasets is a nice option.
That's cool, but I don't see how your suggestion is more elegant, or more concise, or simpler to explain, from just putting raw paths into src though. Can you provide justification?
As many different paths you have, that many "PUT THE PATH TO THE DATA HERE" you need to have. You either put these links to source, or you put them in your .yaml configuration, you still need to write the same amount of paths.
In fact, the .yaml configuration approach is strictly more complex, as it requires the user to write code in one more programming language.
Perhaps it is easier for me to understand if you provide a concrete plan of how this should be done from the side of DrWatson.
Just an idea, haven't tested it. You could use https://github.com/helgee/RemoteFiles.jl and symlink in your internal version of the project. Don't know if RemoteFiles.jl checks the linked file or the link itself. If it checks the actual linked file, the download command would not re-download the file so the path just points to your version of the file outside the DrWatson project.
What you get in addition is, that your project is now reproducible, because, for others who don't have the file in their datadir, RemoteFiles.jl actually downloads the file directly to data.
Btw. I am permanently dealing with TB and PB of data outside of my projects, scattered all over the world on different computing centres. I think that managing these goes way out of scope of DrWatson.jl In our collaboration, we have specific tools to access these files (and wrappers for Python and Julia), so the files can be used in mass processing scripts which run on the Grid. If your data is not scattered that much, you can even consider to have some initialisation scripts which set up NFS mounts or tape access (whatever you use, iRODS, xrootd etc.) or simple symbolic links in the data folder. To me, data should remain in data.
I opened a new issue #255 that might be relevant. If DVC was incorporated, it wouldn't technically matter where data was stored (though keeping it all in data/ is probably still a good standard to vie for). It could be versioned with git without bloating. I think @sebastianpech's idea of using RemoteFiles.jl might integrate well with this too!