FLINT.Cloud
FLINT.Cloud copied to clipboard
Explore DVC and CML for demonstrating continuous data driven integration
Reproducible research requires publishing all code and data necessary to derive the original results. Independently replicating results reinforces our belief that the analysis is valid and correct. However data are often not well supported by existing version control system.
Data version control (DVC) and it's companion software Continuous Machine Learning (CML) provide a framework to continuosly reproduce analyses using Github Actions. I'd like to demonstrate this functionality using the FLINTcloud images for other users to incorporate them in their implementations. It is unlikely, however, that we will want this feature to be active on the FLINTcloud repo - rather the results of this issue should be published as documentation or a case-study.
The first step is to develop a 'run-gcbm' stage - where the GCBM Demo Run is executed within a single Python script. This stage, the input data and the output results, are added to DVC creating a dvc.yaml
file and a dvc.lock
file. When the DVC configuration is pushed to a repository, the the checksums of the model inputs are commited to version control, allowing us to track when they change. Additional pre- and post-processing stages can be added to develop a complete DVC pipeline.
In a specific implementation (e.g moja.belize) users may choose to store their input data on remote storage like Google Drive, an S3 bucket or Azure Blob. This allows the data to be accessed remotely by anyone that pulls the repository, enabling them to easily reproduce the results. Further, a CML Github Action can be configured to reproduce the results on every pull request, allowing us to track changes in data inputs and model results over time.
Implementing this second phase will require that the FLINTcloud Docker images be hosted on a central repository, so that they can be used by the Github Actions Runner. Actions for setup-cml
and setup-dvc
are available, which can be installed as part of the workflow, meaning that we don't need to add CML or DVC as a dependency of our project.
In summary, we want to:
- [x] Publish the FLINTcloud images on a container registry
- [ ] Develop a small
run-gcbm.py
script that executes the GCBM demo run - [ ] Add the
run-gcbm
stage to DVC - [ ] Host the input files on Google Drive
- [ ] Demonstrate the CML Github Action
- [ ] Document the workflow to be used in other implementations
Hi, I am Khushi Chaudhary, a contributor to OpenForce 2022. I would like to work on this issue. I would be making a PR as soon as I am done with resolving the issue. Thank you
Hey, @KhushiChaudhary744!
You can have a look at the recent messages on the #cloud channel of our slack. People have talked about their findings so far over there and the community would be happy to help you catch up!