gcp-ingestion Branches, tags, and deployment strategy

Currently, merging code to master will trigger stage deploys of all Beam jobs, and then make the code eligible for manual deployment to prod. ingestion-sink relies on a user with write privileges to create a tag pointing at current master to trigger building and deploying a container to Dockerhub.

We are discussing fully automating deploys to prod after we automate schema deploys (see Bug 1667920), so want to reevaluate desired conditions for deployment.

We apply branch protection to master, rejecting merges unless they come from a PR with review and various status checks passing. This means that any code that passes through the deployment pipeline to prod has necessarily met these checks. We have no such protection on tagging, so relying on tags to trigger deployments means that any branch can be promoted to prod.

Tagging seems like a convenient way to signal that we want to deploy code, but lacks safeguards.

As long as we don't have concerns about the effect of multiple deploys happening quickly when several PRs are merged, I think I'm in favor of maintaining the status quo: merging code to master makes it eligible for deployment to prod Beam jobs. We will wait for real-world scenarios to come up that might convince us we need a separate signal for stage and prod deploys.

cc @whd

Oct 26 '20 19:10 jklukas

As long as we don't have concerns about the effect of multiple deploys happening quickly when several PRs are merged, I think I'm in favor of maintaining the status quo: merging code to master makes it eligible for deployment to prod Beam jobs.

For beam, because there is a template build step on jenkins that occurs after any merge to master, deploys are spaced at least [build time] apart, which is generally 5-10 minutes. PRs merged within a five minute window will also typically be bundled together, since that's the polling interval used by jenkins. I'm presently not concerned that this will have any impact on production job latency.

We will wait for real-world scenarios to come up that might convince us we need a separate signal for stage and prod deploys.

I think we're in agreement here for beam, and what remains is to translate production deployment eligibility into a schedule (that can hopefully be automated).

Oct 26 '20 20:10 whd

Can we add something like this to .circleci/config.yml to enforce tags being on master?:

if [[ ! $(git branch --contains "tags/${CIRCLE_TAG}" | grep master) == "" ]]; then
   echo "${CIRCLE_TAG} on master"
   # do stuff with tag here
fi

Oct 26 '20 22:10 willkg

@willkg makes a good point that we could add some additional security around building based on tags via bash logic beyond what's available in CircleCI's config structure directly.

I've still managed to convince myself here that deploying master directly is preferable for the simplicity compared to relying on developers making tags.

@whd What's the reasoning behind only deploying docker images based on tags? Why don't we deploy images on every merge to master?

Oct 27 '20 13:10 jklukas

One thing I like about tags is that they become artifacts of what went out. I have a release script that I use to automate building and pushing the tag in Socorro-land and it produces tags and then the tag information is helpful in investigating later. I <3 artifacts.

Tags:

https://github.com/mozilla-services/socorro/tags

I'd be up for adjusting that to help here if that helps.

Oct 27 '20 14:10 willkg

I do like the idea of having artifacts about what was deployed. Also, the more I think about this the more I dislike the current mismatch between how we deploy ingestion-sink based on tags, but Beam jobs based on master.

Before moving further here, I'd appreciate more background from @whd about how/why tags are currently used for docker image deploys. I'd also like his take on whether he has any concerns about the current situation where tags don't have the same protections as the master branch itself. Does secops have best practices around tag-based deploys?

Oct 27 '20 15:10 jklukas

@whd What's the reasoning behind only deploying docker images based on tags? Why don't we deploy images on every merge to master?

We generally follow Services SRE SOP for docker based deployments. That means dockerflow and pipelines structured around merges to master and tagging. We deviate from those standards in that we don't currently have CI in place to push changes from master to a dev environment (which I deemed too expensive to maintain for marginal value), and don't have CI to deploy tags to stage and prod. The existing system is designed around a dev-stage-prod development cycle, with some QA between stage and prod, and manual operator involvement in all production deployments.

To work within this system, I generally treat our docker based services as stable and only update them for necessary feature updates. Typically those updates require infrastructure changes anyway (e.g. switching to batch loads from streaming inserts, deploying the edge flush manager). Essentially, since no new features are generally added to ingestion-edge or ingestion-sink without corresponding ops changes, and because the existing system is designed to be relatively high touch, I have opted to avoid deploying docker code updates except when necessary.

about how/why tags are currently used for docker image deploys. I'd also like his take on whether he has any concerns about the current situation where tags don't have the same protections as the master branch itself. Does secops have best practices around tag-based deploys?

I don't think secops has any concerns with the tag based approach (it again is the Services SRE standard deployment model), except in the context of fully automated deployments to production. Their recommendation would likely be to require operator input on any code change to prod, whether based on tags or not. What we are chiefly exploring here is that, absent a nontrivial QA step between stage and prod, is there any value in gating production deployments on operator input? Should a QA step be introduced (e.g. burnham)?

Oct 27 '20 21:10 whd

I occasionally get asked about this. so I've documented it on Mana and am closing this as SOP.

Oct 28 '22 20:10 whd

gcp-ingestion gcp-ingestion copied to clipboard

Branches, tags, and deployment strategy

gcp-ingestion
gcp-ingestion copied to clipboard