Data versioning
Given how complicated our data picture is becoming, it's probably worth thinking about being more formal about how we track data alongside versions of the repo that created it. As far as I can tell, we essentially have 5 groups of data artifacts
- Training background segment
- Training glitches
- Training waveforms (separate because we don't do SNR rejection)
- Testing background segments
- Testing waveforms
There's an interesting tool out there call dvc which is built for something like this (and I'm sure there are several others). There's an interesting basic tutorial here, and there's a Python API that we could integrate into our code.
I'm still figuring out how these tools work, but I think potential set up could look like:
- On each cluster, created a shared data cache that the code points to, maybe via environment variables (ideally we would have one cache shared between all clusters but not sure how feasible this is)
- The
datagenproject is moved out ofsandboxand into its own pipeline that gets run as part of CI/CD, and that data gets pushed to the relevant caches. This is considered the "production" data. We can start seeding the data generation so that e.g. sampled waveforms won't change if the code for generating them hasn't changed. - At experiment run time, we perform some check to ensure that the data being loaded is consistent with the current version of the repo
- Users can host their own local caches for experimenting with new data generation mechanisms
If something like this works, we can even think about tracking experiment artifacts using this as well, but that's obviously for farther down the line.