DrWatson.jl
DrWatson.jl copied to clipboard
Incorporate DVC as an option during setup
Is your feature request related to a problem? Please describe.
In a new DrWatson project, .gitignore
ignores all directories in a project tree labeled data/
, plots/
, and videos/
by default. This is beneficial to avoid bloating a repository because git doesn't handle large files well, but models and data are tightly knit together, and replicating a project's environment with code, dependencies, data, and visualizations is made complicated by simply excluding all this from the repository.
Describe the solution you'd like Incorporating DVC would tractably extend git version control to large files. The git remote (eg GitHub) does not need to handle the data. Rather, it would live in a dvc remote (eg Google Drive or via SSH/SFTP). DVC adds a lightweight metafile to the git repo to be tracked and versioned, which references the location of the data file itself at its remote. One metafile per data file.
Integrating with DrWatson could be to specify at project initialization whether it should be set up with or without dvc. A corresponding .dvc/
would be generated. A no dvc project would be configured with the current data/
-, plots/
-, and videos/
-ignoring .gitignore
(and .gitattributes
, see #254), and a dvc project would have a more inclusive .gitignore
. Then whenever a file is added to dvc tracking, its newly created *.dvc
metafile and .gitignore
file containing the name of the actual data file would be tracked with git.
As a scientific project assistant, versioning data would be a huge help!
Describe alternatives you've considered DVC's comparison to and integration with other tools and methods. I think git-LFS and git-annex would be the two things closest to DVC, but I've only read some of DVC's material so far. Really, just starting this discussion to see how data tracking might make its way into DrWatson by any means.
Note: no affiliation with DVC or Iterative. I just like their documentation and instructional videos.
This is interesting and I think it is not a bad idea all in all. What is the licensing of DVC?
Apache 2.0
Oh, that's bad. This means that anything that uses this has to be external; we don't want to "pollute" the MIT License .
I'm not too familiar with navigating how well licenses play together, but this wouldn't be derivative of DVC any more than it is of git. User installs DVC, DrWatson calls a few dvc functions to help set up the project, and maybe DrWatson adds dvc stuff into some macros. I don't think dvc source would need to be copied or edited, so I think it's not an issue? 🤔
Oh okay, your suggestion is then a ok for me. Can you sketch some kind of API / functions in this issue, so that we have a bit more concrete idea how this would be?
Great! I'll think it through. Still just a recruit with Julia, DrWatson, and DVC, so it might take a minute.
I have been using DrWatson + DVC for the past couple months or so to great success. I have a few thoughts that may be helpful:
- We provision
dvc
through an Artifact to make something likedvc_jll
-- shouldn't be too difficult as I have looked into this before and it keeps licensing issues separate from DrWatson entirely. - With thanks to Kristoffer, we now have formal support for weak dependencies (i.e. conditional dependencies) that could allow DrWatson to require
dvc_jll
as a weak dependency and only downloads the tool if needed by user - Some functions for an API could be:
-
dvc_init(; remote_dir = datadir("exp_raw"))
-- guided process that helps a user set-up their dvc instance with setting up a remote, pointing where the dvc remote should be mounted to (in my experiences, I recommend the whole ofexp_raw
but could see a subdirectory ofexp_raw
also making sense), and makes the call to the nativedvc init
function -
dvc_add(dvc_remote_path = datadir("exp_raw"), new_data = "path/to/data")
-- process to add data to the dvc remote and update the.dvcignore
file -
dvc_tag()
-- this would be unique to DrWatson but I could see it being useful for experimenters to say what version their experiments; could be hosted perhaps in the .dvcignore or README or CHANGELOG?
-
Just some ideas and not sure if they are at all useful. Happy to keep the discussion going or think more. :smile:
This is like a great suggestion! Unfortunately I am unfamiliar with both DVC and the Artifacts system... So someone else would have to contribute this PR, such as @TheCedarPrince who already has working experience on the manner! I will review thoroughly of course and test the new features!