cml When should we update dvc.lock and dvc push in GitHub Actions?

Discussed in https://github.com/iterative/dvc/discussions/6542

^{Originally posted by Hongbo-Miao September 6, 2021} Currently, my GitHub Actions workflow looks like this: when I open a pull request (change some model codes / params), CML creates a AWS EC2 instance, and DVC pull the data.

Here is my current GitHub Actions workflow:

Click to expand!

cml-cloud-set-up-cloud:
  name: CML (Cloud) - Set up cloud
  runs-on: ubuntu-20.04
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up CML
      uses: iterative/setup-cml@v1
    - name: Set up cloud
      shell: bash
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml-runner \
        --cloud=aws \
        --cloud-region=us-west-2 \
        --cloud-type=t2.small \
        --labels=cml-runner
cml-cloud-train:
  name: CML (Cloud) - Train
  needs: cml-cloud-set-up-cloud
  runs-on: [self-hosted, cml-runner]
  # container: docker://iterativeai/cml:0-dvc2-base1-gpu
  container: docker://iterativeai/cml:0-dvc2-base1
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up Miniconda
      uses: conda-incubator/setup-miniconda@v2
      with:
        miniconda-version: "latest"
        activate-environment: hm-cnn
    - name: Install requirements
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      run: |
        conda install pytorch torchvision torchaudio --channel=pytorch
        conda install pandas
        conda install tabulate
        pip install -r requirements.txt
    - name: Pull Data
      working-directory: convolutional-neural-network
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        dvc pull
    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
    - name: Write CML report
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
      run: |
        echo "# CML (Cloud) Report" >> report.md
        echo "## Params" >> report.md
        cat output/reports/params.txt >> report.md
        cml-send-comment report.md

My dvc.yaml looks like this:

stages:
  prepare:
    cmd: tar -xf data/raw/cifar-10-python.tar.gz --dir=data/processed
    deps:
    - data/raw/cifar-10-python.tar.gz
    outs:
    - data/processed/cifar-10-batches-py/
  main:
    cmd: python main.py
    deps:
    - data/processed/cifar-10-batches-py/
    - evaluate.py
    - main.py
    - model/
    - train.py
    params:
    - lr
    - train.epochs
    outs:
    - output/models/model.pt

After training, if I think the change is good because the performance is better based on the reports,

the dvc.lock I feel needs to get update.
the new model model.pt needs to be uploaded to AWS S3 in my case.

My question is, after dvc repro, am I supposed to add dvc push and then commit? Something like

    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
        dvc push  # New added
        git add .  # New added
        git commit -m "update dvc.lock, etc."  # New added
        git push origin current_pr  # New added, need somehow get the current pull request name

This above method will apply when the pull request is open. However, I kind of feeling the best moment adding would be when I decide merging because I think this is a good pull request that actually improves the machine learning performance. But at this moment, the EC2 instance has been destroyed.

Any suggestion? Thanks!

Sep 10 '21 09:09 daavoo

See previous discussion comments:

https://github.com/iterative/dvc/discussions/6542#discussioncomment-1287600

Sep 10 '21 09:09 daavoo

Just copy another comment to here:

I am more curious about the recommended way for this demo https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml as it is using AWS EC2 instance. However, in current GitHub action workflow, it does not do something like

dvc push
cml-pr .

In the experiment pull request, it also not update the dvc.lock, etc. files, which is why I came up this question. (If we use AWS EC2 instance all time based on this approach, it will end with dvc.lock never got update.

Would be great to have some recommendations! 😃

To be more specific, it would be great to cover these scenarios and still not mess up the DVC:

Opened a pull request, and the experiment result looks good, then merge.
Opened a pull request, and the experiment result does not look good, do a force push to update. In second time, the experiment looks good, then merge.
Opened a pull request, and the experiment result does not look good, add another commit to update. In second time, the experiment looks good, then squash merge.

Sep 10 '21 13:09 hongbo-miao

Another option that I was taught to do was use DVC's run-cache.

In the CML runner:

dvc repro
dvc push --run-cache

On local machine

dvc pull --run-cache
dvc repro --pull
# any further checks or analysis of results
git add .
git commit -m "commit experiment"

This is less automated than the new cml-pr but has the benefit that the developers are making the commits, if that is important.

Sep 17 '21 09:09 mattlbeck

cml cml copied to clipboard

When should we update dvc.lock and dvc push in GitHub Actions?

Discussed in https://github.com/iterative/dvc/discussions/6542

cml
cml copied to clipboard