pudl
pudl copied to clipboard
Allow nightly/stable/ad-hoc ETL builds
Problem
Right now, our nightly builds:
- do not send logs to Cloud Logging in a timely or complete way - this makes debugging outside of whatever ETL logs we have saved very difficult. See the container/image issues we ran into in #3183 and #3195 .
- compete for one of two VMs, limiting concurrency - if someone has started a tag build, the next tag build will trample all over it.
We want to be able to run the ETL, validate its outputs, publish artifacts, and update the nightly
/stable
branches.
We'd also like to be able to correlate the build artifacts with the code version that generated them.
Finally, we'd like to be able to change some behavior based on whether it's a nightly, stable, or ad-hoc run:
Nightly:
- publish build artifacts to an internal cache on GCS, AWS, and datasette
- update
nightly
branch - publish new PUDL data version on Zenodo sandbox
Stable:
- publish build artifacts to GCS, AWS, but not datasette
- update
stable
branch - publish new PUDL data version on Zenodo production
Ad-hoc
- only publish build artifacts to GCS
- potentially run with different ETL configuration files
- do not update any Git branches
- do not publish PUDL data version on Zenodo
Success Criteria
- [ ] nightly builds should kick off automatically each night, and be re-runnable manually
- [ ] stable builds should kick off automatically whenever a
v20XX.XX.XX
tag is pushed, and be re-runnable manually - [ ] ad-hoc builds should be triggered manually, targeting an arbitrary branch/tag
- [ ] nightly/stable/ad-hoc builds do the right publication behavior as defined above
Technical Design
We'll still kick off the build process with a GitHub Actions workflow, which will configure/submit a Google Batch job. The Batch job will run a build script within our existing Docker container.
GHA workflow
- build docker image
- using GHA context, such as github ref / triggering event / workflow dispatch inputs, set specific settings in Google Batch job description as env vars:
- ETL configuration
- path to configuration YML file - workflow dispatch input
- publication settings
- GCS namespace (nightly/stable/ad-hoc) - choose
nightly
orstable
, orad-hoc
based on the tag name - AWS namespace (nightly/stable/none)
- do/don't publish to Datasette
- GCS namespace (nightly/stable/ad-hoc) - choose
- Git settings
- git branch to update (nightly/stable/none)
- current git ref
- ETL configuration
Google batch job description
This will have to be generated dynamically by a Python script that passes the various settings from the GHA context into a JSON file.
The secrets will be kept in Google Secrets so that we don't have to pass them around.
The non-secret settings will be passed into the main script as CLI args via the commands
array.
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"container": {
"imageUri": "docker.io/catalystcoop/pudl-etl:<TAG>",
"commands": [
"micromamba",
...
]
},
"environment": {
"secretVariables": {
"PUDL_BOT_PAT": "projects/PROJECT_ID/secrets/SECRET_NAME/versions/VERSION",
...
}
}
}
]
}
}
],
"allocationPolicy": {
"service_account": {
"email": "some-special-service-account"
}
},
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}
In-container script
This will be a Python script that replicates the functionality of gcp_pudl_etl.sh
, except that it takes command-line arguments that correspond to the actual behaviors we want to customize (as opposed to inferring these from GH ref, etc. in the script as gcp_pudl_etl.sh
does).
Example call:
$ ./run_the_dang_build.py --etl-config-file /path/to/etl_fast.yml --gcs-dest gs://foo --aws-dest s3://... --do-publish-datasette --current-git-ref nightly-YYYY-MM-DD --git-target-branch nightly
It will pick up secrets from the environment variables.
The script will:
- Spin up a local postgres instance to use for Dagster
- run the ETL with Dagster
- run our unit/integration/validation tests
- compress our output
- publish outputs to whatever locations are configured: GCS internal bucket, GCS/AWS distribution buckets, datasette, zenodo
- update a git branch, maybe
- send a notification to #pudl-deployments slack channel
- create an issue for debugging nightly/stable build failures
Not in scope
We will not tackle all of the log consolidation in this issue. Using Google Batch will consolidate many logs, but we won't attempt to do more than that incidental improvement here.
Tasks / order of operations
There's a bunch of things that have to happen - here's the order in which we should do things so that we get something useful out fast.
## Subtasks
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3208
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3209
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3210