DrWatson.jl icon indicating copy to clipboard operation
DrWatson.jl copied to clipboard

Incorporate DVC as an option during setup

Open thompsonmj opened this issue 3 years ago • 8 comments

Is your feature request related to a problem? Please describe. In a new DrWatson project, .gitignore ignores all directories in a project tree labeled data/, plots/, and videos/ by default. This is beneficial to avoid bloating a repository because git doesn't handle large files well, but models and data are tightly knit together, and replicating a project's environment with code, dependencies, data, and visualizations is made complicated by simply excluding all this from the repository.

Describe the solution you'd like Incorporating DVC would tractably extend git version control to large files. The git remote (eg GitHub) does not need to handle the data. Rather, it would live in a dvc remote (eg Google Drive or via SSH/SFTP). DVC adds a lightweight metafile to the git repo to be tracked and versioned, which references the location of the data file itself at its remote. One metafile per data file.

Integrating with DrWatson could be to specify at project initialization whether it should be set up with or without dvc. A corresponding .dvc/ would be generated. A no dvc project would be configured with the current data/-, plots/-, and videos/-ignoring .gitignore (and .gitattributes, see #254), and a dvc project would have a more inclusive .gitignore. Then whenever a file is added to dvc tracking, its newly created *.dvc metafile and .gitignore file containing the name of the actual data file would be tracked with git.

As a scientific project assistant, versioning data would be a huge help!

Describe alternatives you've considered DVC's comparison to and integration with other tools and methods. I think git-LFS and git-annex would be the two things closest to DVC, but I've only read some of DVC's material so far. Really, just starting this discussion to see how data tracking might make its way into DrWatson by any means.

Note: no affiliation with DVC or Iterative. I just like their documentation and instructional videos.

thompsonmj avatar May 24 '21 15:05 thompsonmj

This is interesting and I think it is not a bad idea all in all. What is the licensing of DVC?

Datseris avatar May 24 '21 21:05 Datseris

Apache 2.0

thompsonmj avatar May 24 '21 21:05 thompsonmj

Oh, that's bad. This means that anything that uses this has to be external; we don't want to "pollute" the MIT License .

Datseris avatar May 25 '21 07:05 Datseris

I'm not too familiar with navigating how well licenses play together, but this wouldn't be derivative of DVC any more than it is of git. User installs DVC, DrWatson calls a few dvc functions to help set up the project, and maybe DrWatson adds dvc stuff into some macros. I don't think dvc source would need to be copied or edited, so I think it's not an issue? 🤔

thompsonmj avatar May 25 '21 12:05 thompsonmj

Oh okay, your suggestion is then a ok for me. Can you sketch some kind of API / functions in this issue, so that we have a bit more concrete idea how this would be?

Datseris avatar May 25 '21 12:05 Datseris

Great! I'll think it through. Still just a recruit with Julia, DrWatson, and DVC, so it might take a minute.

thompsonmj avatar May 25 '21 13:05 thompsonmj

I have been using DrWatson + DVC for the past couple months or so to great success. I have a few thoughts that may be helpful:

  1. We provision dvc through an Artifact to make something like dvc_jll -- shouldn't be too difficult as I have looked into this before and it keeps licensing issues separate from DrWatson entirely.
  2. With thanks to Kristoffer, we now have formal support for weak dependencies (i.e. conditional dependencies) that could allow DrWatson to require dvc_jll as a weak dependency and only downloads the tool if needed by user
  3. Some functions for an API could be:
    • dvc_init(; remote_dir = datadir("exp_raw")) -- guided process that helps a user set-up their dvc instance with setting up a remote, pointing where the dvc remote should be mounted to (in my experiences, I recommend the whole of exp_raw but could see a subdirectory of exp_raw also making sense), and makes the call to the native dvc init function
    • dvc_add(dvc_remote_path = datadir("exp_raw"), new_data = "path/to/data") -- process to add data to the dvc remote and update the .dvcignore file
    • dvc_tag() -- this would be unique to DrWatson but I could see it being useful for experimenters to say what version their experiments; could be hosted perhaps in the .dvcignore or README or CHANGELOG?

Just some ideas and not sure if they are at all useful. Happy to keep the discussion going or think more. :smile:

TheCedarPrince avatar Dec 16 '22 16:12 TheCedarPrince

This is like a great suggestion! Unfortunately I am unfamiliar with both DVC and the Artifacts system... So someone else would have to contribute this PR, such as @TheCedarPrince who already has working experience on the manner! I will review thoroughly of course and test the new features!

Datseris avatar Dec 20 '22 15:12 Datseris