pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Allow nightly/stable/ad-hoc ETL builds

Open jdangerx opened this issue 1 year ago • 0 comments

Problem

Right now, our nightly builds:

  • do not send logs to Cloud Logging in a timely or complete way - this makes debugging outside of whatever ETL logs we have saved very difficult. See the container/image issues we ran into in #3183 and #3195 .
  • compete for one of two VMs, limiting concurrency - if someone has started a tag build, the next tag build will trample all over it.

We want to be able to run the ETL, validate its outputs, publish artifacts, and update the nightly/stable branches.

We'd also like to be able to correlate the build artifacts with the code version that generated them.

Finally, we'd like to be able to change some behavior based on whether it's a nightly, stable, or ad-hoc run:

Nightly:

  • publish build artifacts to an internal cache on GCS, AWS, and datasette
  • update nightly branch
  • publish new PUDL data version on Zenodo sandbox

Stable:

  • publish build artifacts to GCS, AWS, but not datasette
  • update stable branch
  • publish new PUDL data version on Zenodo production

Ad-hoc

  • only publish build artifacts to GCS
  • potentially run with different ETL configuration files
  • do not update any Git branches
  • do not publish PUDL data version on Zenodo

Success Criteria

  • [ ] nightly builds should kick off automatically each night, and be re-runnable manually
  • [ ] stable builds should kick off automatically whenever a v20XX.XX.XX tag is pushed, and be re-runnable manually
  • [ ] ad-hoc builds should be triggered manually, targeting an arbitrary branch/tag
  • [ ] nightly/stable/ad-hoc builds do the right publication behavior as defined above

Technical Design

We'll still kick off the build process with a GitHub Actions workflow, which will configure/submit a Google Batch job. The Batch job will run a build script within our existing Docker container.

GHA workflow

  • build docker image
  • using GHA context, such as github ref / triggering event / workflow dispatch inputs, set specific settings in Google Batch job description as env vars:
    • ETL configuration
      • path to configuration YML file - workflow dispatch input
    • publication settings
      • GCS namespace (nightly/stable/ad-hoc) - choose nightly or stable, or ad-hoc based on the tag name
      • AWS namespace (nightly/stable/none)
      • do/don't publish to Datasette
    • Git settings
      • git branch to update (nightly/stable/none)
      • current git ref

Google batch job description

This will have to be generated dynamically by a Python script that passes the various settings from the GHA context into a JSON file.

The secrets will be kept in Google Secrets so that we don't have to pass them around.

The non-secret settings will be passed into the main script as CLI args via the commands array.

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "container": {
              "imageUri": "docker.io/catalystcoop/pudl-etl:<TAG>",
              "commands": [
                "micromamba",
                ...
              ]
            },
            "environment": {
              "secretVariables": {
                "PUDL_BOT_PAT": "projects/PROJECT_ID/secrets/SECRET_NAME/versions/VERSION",
                ...
              }
            }
          }
        ]
      }
    }
  ],
  "allocationPolicy": {
    "service_account": {
      "email": "some-special-service-account"
    }
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

In-container script

This will be a Python script that replicates the functionality of gcp_pudl_etl.sh, except that it takes command-line arguments that correspond to the actual behaviors we want to customize (as opposed to inferring these from GH ref, etc. in the script as gcp_pudl_etl.sh does).

Example call:

$ ./run_the_dang_build.py --etl-config-file /path/to/etl_fast.yml --gcs-dest gs://foo --aws-dest s3://... --do-publish-datasette --current-git-ref nightly-YYYY-MM-DD --git-target-branch nightly

It will pick up secrets from the environment variables.

The script will:

  1. Spin up a local postgres instance to use for Dagster
  2. run the ETL with Dagster
  3. run our unit/integration/validation tests
  4. compress our output
  5. publish outputs to whatever locations are configured: GCS internal bucket, GCS/AWS distribution buckets, datasette, zenodo
  6. update a git branch, maybe
  7. send a notification to #pudl-deployments slack channel
  8. create an issue for debugging nightly/stable build failures

Not in scope

We will not tackle all of the log consolidation in this issue. Using Google Batch will consolidate many logs, but we won't attempt to do more than that incidental improvement here.

Tasks / order of operations

There's a bunch of things that have to happen - here's the order in which we should do things so that we get something useful out fast.

## Subtasks
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3208
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3209
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3210

jdangerx avatar Jan 03 '24 22:01 jdangerx