Search
Search copied to clipboard
Continuous deployment with DVC
Scope
We need to make sure that we know when the changes in our source code influence our models / datasets. Without any manual procedures!
Current problems
- We have multiple Dockerfiles that have a version tag of bbsearch in them
- Self-referential
- One needs to build them, run them and dvc repro manually
- The tag is bumped up at the discretion of the developeper
Proposed solution
Github action triggered on each push
- connect to a container / build a new one on Blue Brain's ML server
- git checkout the given commit
- run dvc repro (or other)
- (dvc metrics diff)
Notes
The most attainable/reasonable setup would be to use/replicate https://github.com/iterative/cml and just trigger some process on our server with pushes to a branch.
So it turns out that using "Self hosted runners" is not recommended for public repositories. https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners
We recommend that you only use self-hosted runners with private repositories. This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.
I am not sure if we want to use github servers to automatically train or evaluate our models.
See below a script that could be turned into a github action.
What is the goal?
Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely
- Check whether all relevant assets (listed in
dvc.lockfiles) are available on remote - Running
dvc reprodoes not introduce any difference (dvc diffis empty)
In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked
What are the challenges
- We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring)
- All the
dvcrelated things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets - We need to be really careful about permissions
- This action can only be launched by an authorized person
- Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details
- It might be really slow (e.g.
dvc pullwill need to download multiple GBs of data and models) - Potential reproducibility + environment issues (we do not want to run this inside of a docker container)
What are the benefits
- Big time saver
- We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked
- If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remote
Suggested script (WIP)
Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).
set -e # if any command exits with a nonzero code the entire script exits too
set -x
pip install -r requirements.txt
dvc pull # also checks that everything listed in dvc.lock is on remote
# NER
pushd data_and_models/pipelines/ner/
dvc repro
test -z "$(dvc diff)" # exits with nonzero code if there are any changes
popd
# Sentence embeddings
pushd data_and_models/pipelines/sentence_embedding/
dvc repro
test -z "$(dvc diff)" # exits with nonzero code if there are any changes
popd
This is a must-have!
One comment:
Potential reproducibility + environment issues (we do not want to run this inside of a docker container)
Why wouldn't we want this to run inside a Docker container?
Indeed, not running inside a Docker container is:
- the opposite of what was chosen to be done at the moment,
- the opposite of the best practice to ensure reproducibility of environments, as far as I know.
This is a must-have!
One comment:
Potential reproducibility + environment issues (we do not want to run this inside of a docker container)
Why wouldn't we want this to run inside a Docker container?
Indeed, not running inside a Docker container is:
- the opposite of what was chosen to be done at the moment,
- the opposite of the best practice to ensure reproducibility of environments, as far as I know.
In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.
See below a script that could be turned into a github action.
What is the goal?
Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely
* Check whether all relevant assets (listed in `dvc.lock` files) are available on remote * Running `dvc repro` does not introduce any difference (`dvc diff` is empty)In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked
What are the challenges
* We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring) * All the `dvc` related things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets * We need to be really careful about permissions * This action can only be launched by an authorized person * Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details * It might be really slow * Potential reproducibility + environment issues (we do not want to run this inside of a docker container)What are the benefits
* Big time saver * We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked * If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remoteSuggested script (WIP)
Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).
set -e # if any command exits with a nonzero code the entire script exits too set -x pip install -r requirements.txt dvc pull # also checks that everything listed in dvc.lock is on remote # NER pushd data_and_models/pipelines/ner/ dvc repro test -z "$(dvc diff)" # exits with nonzero code if there are any changes popd # Sentence embeddings pushd data_and_models/pipelines/sentence_embedding/ dvc repro test -z "$(dvc diff)" # exits with nonzero code if there are any changes popd
I also agree that this kind of test needs to ba automated. Among all points you mentioned above I'm worried about the following two:
- Can we safely SSH to the DVC remote from GitHub? Is this compliant with the BBP policy?
- Doing a 5GB pull is pretty heavy.
While dealing with the latest DVC tests I had the following issues / annoyances:
- Re-building the docker containers takes really long, we have to re-download and re-install all BBS dependencies every time
- Doing a 5GB DVC pull
- When doing repro on sentence embedding a model had to be dowloaded multiple times (transformes?)
- I can't run the container with my own username (errors out)
dvc pulldoesn't work out of the box, one needs to manually re-configure.- I had a huge
git diffoutput with files not related to DVC (tests,docs,notebooks, ...)
If what @jankrepl suggest above turns out to be infeasible, then we can think about writing something automated on our servers.
@jankrepl
In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.
The GitHub container might also just change and the reproduction fail because of the change. Maybe one could set the GitHub container version or similar to ensure reproducibility.
Concerning using GitHub Actions — we cannot have GitHub servers (1) set up a VPN connection with BBP (2) pull/push data from BBP servers.
But we wait for GitLab actions to be available to do that on BBP premises.