pudl
pudl copied to clipboard
Run nightly build on feature branches
Our nightly builds are more reliable and allow us to catch errors in the ETL when new code has been merged into dev
. This is great, but it would be nice to know about the error before code is merged into dev so we don't have to chase down the PR responsible for breaking the nightly builds on dev.
There are two github action triggers that might allow us to do this.
-
workflow_dispatch
: This trigger allows you to run an action manually and specify the branch to use. When someone is ready to merge a PR into dev they can manually run a nightly build. This isn't ideal because people need to remember to do it! -
pull_request_review
: A nightly build action could be triggered when a PR is approved, this way people don't need to remember to run a full build prior to merging the PR into dev.
These solutions don't prevent a user from merging a PR into dev if the full build fails because the failures happen on a GCE instance not a GitHub runner. Ideally, we'd use a self-hosted runner to leverage GCP VMs but GitHub doesn't recommend this for public repos.
I think option two is best for our purposes. What do you think @zaneselvans?
Actually, we would need to automate the creation and teardown of GCP VMs to support multiple feature branches running builds. I think we could use terraform for this #1785
Hmm, that's an interesting option. Is there any reason why allowing both would be bad? For the PR review trigger we'd need to get in the habit of not immediately merging things into dev
when they're approved, and instead wait for the build to complete (which is hours, probably the next day in a lot of cases).
How big of a lift is the terraform stuff?
Yes, there would be a delay between the PR approval and merging it into dev.
I think the Terraform stuff would probably take me a sprint because it's a new tool. Given we may be dagster cloud soon, I don't feel like #1785 is worth pursuing right now. I would love to use/learn Terraform for our next big cloud infrastructure project, though. It seems like the standard for managing cloud resources these days.
@jdangerx mentioned a new GCP service called Batch which might solve some of our problems. You can submit batch processing jobs to google and it will spin up a VM, run the container, shut the VM down and delete it. There are a couple of benefits to using Batch:
- Batch shuts down the VM when the process is done. This will allow us to remove our janky code that makes an API call from within the container to shut down the VM.
- Batch deletes the VM when the process is done. This will allow us to spin up an arbitrary number of VMs for branches.
- I think you can tell Batch where to store the process's outputs. This will allow us to remove the gcloud and aws cp commands from the nightly builds script.
Still preliminary research but it seems like a good option.
Oh this sounds like a good option!
Another option is to use Dagster Cloud to run our nightly ETL. We could debug out builds using dagit instead of collecting all of stdout in a file. However I'm not sure how dagster cloud handle python logs. Also, it's probably more expensive than using something like GCP Batch. 0.04 per minute * 7 hour nightly builds * 60 minutes = $16.8 per build. It might get cheaper based on how much you use the platform, clarifying with the dagster folks. How current VM is 0.36156 / hour * 7 hours = $2.4 so way cheaper :/
Another issue would be running tests. We'd have to create a separate job to run our tests which might be a big change.
If Batch is easy to set up, it might be a better option, for now, $ wise.
Dagster updated its pricing model. We could get 10,000 asset materializations for $100 a month and compute time for $0.005 per minute. We have about 260 asset right now (let's call it 300) so we'd get about 33 full ETL builds a month. Full builds take about 8 hours on our beefy VM, so 8 * 60 * 0.005 = $2.4 per run. 100 + (33 * 2.4) ~= 180 which is about how much we spend on our GCP VMs.
Benefits:
- Won't have to think about managing infrastructure overhead as much
- We'll likely need to set up a postgres database soon to handle our logs (see #2713). Dagster cloud would handle this for us.
- Will probably be easier to debug nightly build failures with dagit
Concerns/questions:
- For the serverless option, what size machines do they offer? What happens if we need to use multiple machines one day?
- I'm not sure how branch deployments work. Would we execute the full ETL? This would be expensive!
- Currently, we run a large suite of pytests after teh ETL to validate the data. How would this work in dagster cloud? Would we have to create a job that runs the tests after the main ETL finishes?
- Would be amazing if dagster could rematerialize assets that had code changes.
We could in theory do this now that we can trigger builds on Batch. Doing this automatically sounds expensive, though - and the workflow_dispatch
option already works. Do you think we can close this @bendnorman @zaneselvans ?
Agreed! I say we close it.
Sounds good to me! There are only some big feature branches where it'll make sense to do this, and workflow_dispatch
is good enough.