pudl Run nightly build on feature branches

Our nightly builds are more reliable and allow us to catch errors in the ETL when new code has been merged into dev. This is great, but it would be nice to know about the error before code is merged into dev so we don't have to chase down the PR responsible for breaking the nightly builds on dev.

There are two github action triggers that might allow us to do this.

workflow_dispatch: This trigger allows you to run an action manually and specify the branch to use. When someone is ready to merge a PR into dev they can manually run a nightly build. This isn't ideal because people need to remember to do it!
pull_request_review: A nightly build action could be triggered when a PR is approved, this way people don't need to remember to run a full build prior to merging the PR into dev.

These solutions don't prevent a user from merging a PR into dev if the full build fails because the failures happen on a GCE instance not a GitHub runner. Ideally, we'd use a self-hosted runner to leverage GCP VMs but GitHub doesn't recommend this for public repos.

I think option two is best for our purposes. What do you think @zaneselvans?

Sep 19 '22 22:09 bendnorman

Actually, we would need to automate the creation and teardown of GCP VMs to support multiple feature branches running builds. I think we could use terraform for this #1785

Sep 20 '22 00:09 bendnorman

Hmm, that's an interesting option. Is there any reason why allowing both would be bad? For the PR review trigger we'd need to get in the habit of not immediately merging things into dev when they're approved, and instead wait for the build to complete (which is hours, probably the next day in a lot of cases).

How big of a lift is the terraform stuff?

Sep 20 '22 03:09 zaneselvans

Yes, there would be a delay between the PR approval and merging it into dev.

I think the Terraform stuff would probably take me a sprint because it's a new tool. Given we may be dagster cloud soon, I don't feel like #1785 is worth pursuing right now. I would love to use/learn Terraform for our next big cloud infrastructure project, though. It seems like the standard for managing cloud resources these days.

Sep 21 '22 01:09 bendnorman

@jdangerx mentioned a new GCP service called Batch which might solve some of our problems. You can submit batch processing jobs to google and it will spin up a VM, run the container, shut the VM down and delete it. There are a couple of benefits to using Batch:

Batch shuts down the VM when the process is done. This will allow us to remove our janky code that makes an API call from within the container to shut down the VM.
Batch deletes the VM when the process is done. This will allow us to spin up an arbitrary number of VMs for branches.
I think you can tell Batch where to store the process's outputs. This will allow us to remove the gcloud and aws cp commands from the nightly builds script.

Still preliminary research but it seems like a good option.

Jan 11 '23 22:01 bendnorman

Oh this sounds like a good option!

Jan 12 '23 05:01 zaneselvans

Another option is to use Dagster Cloud to run our nightly ETL. We could debug out builds using dagit instead of collecting all of stdout in a file. However I'm not sure how dagster cloud handle python logs. Also, it's probably more expensive than using something like GCP Batch. 0.04 per minute * 7 hour nightly builds * 60 minutes = $16.8 per build. It might get cheaper based on how much you use the platform, clarifying with the dagster folks. How current VM is 0.36156 / hour * 7 hours = $2.4 so way cheaper :/

Another issue would be running tests. We'd have to create a separate job to run our tests which might be a big change.

If Batch is easy to set up, it might be a better option, for now, $ wise.

May 03 '23 20:05 bendnorman

Dagster updated its pricing model. We could get 10,000 asset materializations for $100 a month and compute time for $0.005 per minute. We have about 260 asset right now (let's call it 300) so we'd get about 33 full ETL builds a month. Full builds take about 8 hours on our beefy VM, so 8 * 60 * 0.005 = $2.4 per run. 100 + (33 * 2.4) ~= 180 which is about how much we spend on our GCP VMs.

Benefits:

Won't have to think about managing infrastructure overhead as much
We'll likely need to set up a postgres database soon to handle our logs (see #2713). Dagster cloud would handle this for us.
Will probably be easier to debug nightly build failures with dagit

Concerns/questions:

For the serverless option, what size machines do they offer? What happens if we need to use multiple machines one day?
I'm not sure how branch deployments work. Would we execute the full ETL? This would be expensive!
Currently, we run a large suite of pytests after teh ETL to validate the data. How would this work in dagster cloud? Would we have to create a job that runs the tests after the main ETL finishes?
Would be amazing if dagster could rematerialize assets that had code changes.

Jul 06 '23 20:07 bendnorman

We could in theory do this now that we can trigger builds on Batch. Doing this automatically sounds expensive, though - and the workflow_dispatch option already works. Do you think we can close this @bendnorman @zaneselvans ?

Jan 29 '24 22:01 jdangerx

Agreed! I say we close it.

Jan 29 '24 22:01 bendnorman

Sounds good to me! There are only some big feature branches where it'll make sense to do this, and workflow_dispatch is good enough.

Jan 29 '24 22:01 zaneselvans

pudl pudl copied to clipboard

Run nightly build on feature branches

pudl
pudl copied to clipboard