Search icon indicating copy to clipboard operation
Search copied to clipboard

Continuous deployment with DVC

Open FrancescoCasalegno opened this issue 4 years ago • 8 comments
trafficstars

Scope

We need to make sure that we know when the changes in our source code influence our models / datasets.  Without any manual procedures!

 

Current problems

  • We have multiple Dockerfiles that have a version tag of bbsearch in them
    • Self-referential
    • One needs to build them, run them and dvc repro manually
    • The tag is bumped up at the discretion of the developeper

 

Proposed solution

Github action triggered on each push

  • connect to a container / build a new one on Blue Brain's ML server
  • git checkout the given commit
  • run dvc repro (or other)  
  • (dvc metrics diff)

 

Notes

The most attainable/reasonable setup would be to use/replicate https://github.com/iterative/cml and just trigger some process on our server with pushes to a branch.

FrancescoCasalegno avatar Mar 12 '21 13:03 FrancescoCasalegno

So it turns out that using "Self hosted runners" is not recommended for public repositories. https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners

We recommend that you only use self-hosted runners with private repositories. This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

I am not sure if we want to use github servers to automatically train or evaluate our models.

jankrepl avatar Mar 16 '21 23:03 jankrepl

See below a script that could be turned into a github action.

What is the goal?

Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely

  • Check whether all relevant assets (listed in dvc.lock files) are available on remote
  • Running dvc repro does not introduce any difference (dvc diff is empty)

In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked

What are the challenges

  • We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring)
  • All the dvc related things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets
  • We need to be really careful about permissions
    • This action can only be launched by an authorized person
    • Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details
  • It might be really slow (e.g. dvc pull will need to download multiple GBs of data and models)
  • Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

What are the benefits

  • Big time saver
  • We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked
  • If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remote

Suggested script (WIP)

Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).

set -e  # if any command exits with a nonzero code the entire script exits too
set -x 

pip install -r requirements.txt
dvc pull  # also checks that everything listed in dvc.lock is on remote 

# NER
pushd data_and_models/pipelines/ner/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

# Sentence embeddings
pushd data_and_models/pipelines/sentence_embedding/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

jankrepl avatar Mar 26 '21 12:03 jankrepl

This is a must-have!

One comment:

Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

Why wouldn't we want this to run inside a Docker container?

Indeed, not running inside a Docker container is:

  1. the opposite of what was chosen to be done at the moment,
  2. the opposite of the best practice to ensure reproducibility of environments, as far as I know.

pafonta avatar Mar 26 '21 13:03 pafonta

This is a must-have!

One comment:

Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

Why wouldn't we want this to run inside a Docker container?

Indeed, not running inside a Docker container is:

  1. the opposite of what was chosen to be done at the moment,
  2. the opposite of the best practice to ensure reproducibility of environments, as far as I know.

In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.

jankrepl avatar Mar 26 '21 14:03 jankrepl

See below a script that could be turned into a github action.

What is the goal?

Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely

* Check whether all relevant assets (listed in `dvc.lock` files) are available on remote

* Running `dvc repro` does not introduce any difference (`dvc diff` is empty)

In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked

What are the challenges

* We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring)

* All the `dvc` related things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets

* We need to be really careful about permissions
  
  * This action can only be launched by an authorized person
  * Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details

* It might be really slow

* Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

What are the benefits

* Big time saver

* We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked

* If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remote

Suggested script (WIP)

Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).

set -e  # if any command exits with a nonzero code the entire script exits too
set -x 

pip install -r requirements.txt
dvc pull  # also checks that everything listed in dvc.lock is on remote 

# NER
pushd data_and_models/pipelines/ner/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

# Sentence embeddings
pushd data_and_models/pipelines/sentence_embedding/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

I also agree that this kind of test needs to ba automated. Among all points you mentioned above I'm worried about the following two:

  • Can we safely SSH to the DVC remote from GitHub? Is this compliant with the BBP policy?
  • Doing a 5GB pull is pretty heavy.

Stannislav avatar Mar 26 '21 14:03 Stannislav

While dealing with the latest DVC tests I had the following issues / annoyances:

  1. Re-building the docker containers takes really long, we have to re-download and re-install all BBS dependencies every time
  2. Doing a 5GB DVC pull
  3. When doing repro on sentence embedding a model had to be dowloaded multiple times (transformes?)
  4. I can't run the container with my own username (errors out)
  5. dvc pull doesn't work out of the box, one needs to manually re-configure.
  6. I had a huge git diff output with files not related to DVC (tests, docs, notebooks, ...)

If what @jankrepl suggest above turns out to be infeasible, then we can think about writing something automated on our servers.

Stannislav avatar Mar 26 '21 14:03 Stannislav

@jankrepl

In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.

The GitHub container might also just change and the reproduction fail because of the change. Maybe one could set the GitHub container version or similar to ensure reproducibility.

pafonta avatar Mar 26 '21 15:03 pafonta

Concerning using GitHub Actions — we cannot have GitHub servers (1) set up a VPN connection with BBP (2) pull/push data from BBP servers.

But we wait for GitLab actions to be available to do that on BBP premises.

FrancescoCasalegno avatar Mar 30 '21 14:03 FrancescoCasalegno