Allow nightly/stable/ad-hoc ETL builds

Open jdangerx opened this issue 1 year ago • 0 comments

Problem

Right now, our nightly builds:

do not send logs to Cloud Logging in a timely or complete way - this makes debugging outside of whatever ETL logs we have saved very difficult. See the container/image issues we ran into in #3183 and #3195 .
compete for one of two VMs, limiting concurrency - if someone has started a tag build, the next tag build will trample all over it.

We want to be able to run the ETL, validate its outputs, publish artifacts, and update the nightly/stable branches.

We'd also like to be able to correlate the build artifacts with the code version that generated them.

Finally, we'd like to be able to change some behavior based on whether it's a nightly, stable, or ad-hoc run:

Nightly:

publish build artifacts to an internal cache on GCS, AWS, and datasette
update nightly branch
publish new PUDL data version on Zenodo sandbox

Stable:

publish build artifacts to GCS, AWS, but not datasette
update stable branch
publish new PUDL data version on Zenodo production

Ad-hoc

only publish build artifacts to GCS
potentially run with different ETL configuration files
do not update any Git branches
do not publish PUDL data version on Zenodo

Success Criteria

[ ] nightly builds should kick off automatically each night, and be re-runnable manually
[ ] stable builds should kick off automatically whenever a v20XX.XX.XX tag is pushed, and be re-runnable manually
[ ] ad-hoc builds should be triggered manually, targeting an arbitrary branch/tag
[ ] nightly/stable/ad-hoc builds do the right publication behavior as defined above

Technical Design

We'll still kick off the build process with a GitHub Actions workflow, which will configure/submit a Google Batch job. The Batch job will run a build script within our existing Docker container.

GHA workflow

build docker image
using GHA context, such as github ref / triggering event / workflow dispatch inputs, set specific settings in Google Batch job description as env vars:
- ETL configuration
  - path to configuration YML file - workflow dispatch input
- publication settings
  - GCS namespace (nightly/stable/ad-hoc) - choose nightly or stable, or ad-hoc based on the tag name
  - AWS namespace (nightly/stable/none)
  - do/don't publish to Datasette
- Git settings
  - git branch to update (nightly/stable/none)
  - current git ref

Google batch job description

This will have to be generated dynamically by a Python script that passes the various settings from the GHA context into a JSON file.

The secrets will be kept in Google Secrets so that we don't have to pass them around.

The non-secret settings will be passed into the main script as CLI args via the commands array.

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "container": {
              "imageUri": "docker.io/catalystcoop/pudl-etl:<TAG>",
              "commands": [
                "micromamba",
                ...
              ]
            },
            "environment": {
              "secretVariables": {
                "PUDL_BOT_PAT": "projects/PROJECT_ID/secrets/SECRET_NAME/versions/VERSION",
                ...
              }
            }
          }
        ]
      }
    }
  ],
  "allocationPolicy": {
    "service_account": {
      "email": "some-special-service-account"
    }
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

In-container script

This will be a Python script that replicates the functionality of gcp_pudl_etl.sh, except that it takes command-line arguments that correspond to the actual behaviors we want to customize (as opposed to inferring these from GH ref, etc. in the script as gcp_pudl_etl.sh does).

Example call:

$ ./run_the_dang_build.py --etl-config-file /path/to/etl_fast.yml --gcs-dest gs://foo --aws-dest s3://... --do-publish-datasette --current-git-ref nightly-YYYY-MM-DD --git-target-branch nightly

It will pick up secrets from the environment variables.

The script will:

Spin up a local postgres instance to use for Dagster
run the ETL with Dagster
run our unit/integration/validation tests
compress our output
publish outputs to whatever locations are configured: GCS internal bucket, GCS/AWS distribution buckets, datasette, zenodo
update a git branch, maybe
send a notification to #pudl-deployments slack channel
create an issue for debugging nightly/stable build failures

Not in scope

We will not tackle all of the log consolidation in this issue. Using Google Batch will consolidate many logs, but we won't attempt to do more than that incidental improvement here.

Tasks / order of operations

There's a bunch of things that have to happen - here's the order in which we should do things so that we get something useful out fast.

## Subtasks
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3208
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3209
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3210

Jan 03 '24 22:01 jdangerx

pudl pudl copied to clipboard

Allow nightly/stable/ad-hoc ETL builds

Problem

Success Criteria

Technical Design

GHA workflow

Google batch job description

In-container script

Not in scope

Tasks / order of operations

pudl
pudl copied to clipboard