FLINT.Cloud icon indicating copy to clipboard operation
FLINT.Cloud copied to clipboard

Explore DVC and CML for demonstrating continuous data driven integration

Open aornugent opened this issue 3 years ago • 2 comments

Reproducible research requires publishing all code and data necessary to derive the original results. Independently replicating results reinforces our belief that the analysis is valid and correct. However data are often not well supported by existing version control system.

Data version control (DVC) and it's companion software Continuous Machine Learning (CML) provide a framework to continuosly reproduce analyses using Github Actions. I'd like to demonstrate this functionality using the FLINTcloud images for other users to incorporate them in their implementations. It is unlikely, however, that we will want this feature to be active on the FLINTcloud repo - rather the results of this issue should be published as documentation or a case-study.

The first step is to develop a 'run-gcbm' stage - where the GCBM Demo Run is executed within a single Python script. This stage, the input data and the output results, are added to DVC creating a dvc.yaml file and a dvc.lock file. When the DVC configuration is pushed to a repository, the the checksums of the model inputs are commited to version control, allowing us to track when they change. Additional pre- and post-processing stages can be added to develop a complete DVC pipeline.

In a specific implementation (e.g moja.belize) users may choose to store their input data on remote storage like Google Drive, an S3 bucket or Azure Blob. This allows the data to be accessed remotely by anyone that pulls the repository, enabling them to easily reproduce the results. Further, a CML Github Action can be configured to reproduce the results on every pull request, allowing us to track changes in data inputs and model results over time.

Implementing this second phase will require that the FLINTcloud Docker images be hosted on a central repository, so that they can be used by the Github Actions Runner. Actions for setup-cml and setup-dvc are available, which can be installed as part of the workflow, meaning that we don't need to add CML or DVC as a dependency of our project.

In summary, we want to:

  • [x] Publish the FLINTcloud images on a container registry
  • [ ] Develop a small run-gcbm.py script that executes the GCBM demo run
  • [ ] Add the run-gcbm stage to DVC
  • [ ] Host the input files on Google Drive
  • [ ] Demonstrate the CML Github Action
  • [ ] Document the workflow to be used in other implementations

aornugent avatar Jan 01 '22 23:01 aornugent

Hi, I am Khushi Chaudhary, a contributor to OpenForce 2022. I would like to work on this issue. I would be making a PR as soon as I am done with resolving the issue. Thank you

KhushiChaudhary744 avatar Mar 08 '22 17:03 KhushiChaudhary744

Hey, @KhushiChaudhary744!

You can have a look at the recent messages on the #cloud channel of our slack. People have talked about their findings so far over there and the community would be happy to help you catch up!

shloka-gupta avatar Mar 08 '22 18:03 shloka-gupta