Data versioning

Open alecgunny opened this issue 2 years ago • 2 comments

Given how complicated our data picture is becoming, it's probably worth thinking about being more formal about how we track data alongside versions of the repo that created it. As far as I can tell, we essentially have 5 groups of data artifacts

Training background segment
Training glitches
Training waveforms (separate because we don't do SNR rejection)
Testing background segments
Testing waveforms

There's an interesting tool out there call dvc which is built for something like this (and I'm sure there are several others). There's an interesting basic tutorial here, and there's a Python API that we could integrate into our code.

I'm still figuring out how these tools work, but I think potential set up could look like:

On each cluster, created a shared data cache that the code points to, maybe via environment variables (ideally we would have one cache shared between all clusters but not sure how feasible this is)
The datagen project is moved out of sandbox and into its own pipeline that gets run as part of CI/CD, and that data gets pushed to the relevant caches. This is considered the "production" data. We can start seeding the data generation so that e.g. sampled waveforms won't change if the code for generating them hasn't changed.
At experiment run time, we perform some check to ensure that the data being loaded is consistent with the current version of the repo
Users can host their own local caches for experimenting with new data generation mechanisms

If something like this works, we can even think about tracking experiment artifacts using this as well, but that's obviously for farther down the line.

Mar 10 '23 19:03 alecgunny