thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Discuss CI/CD logic and optimise the process

Open wiardvanrij opened this issue 2 years ago • 18 comments

Is your proposal related to a problem?

Currently we use circleci for:

  • any update to main, which publishes a new docker image
  • any update on a tag, which does a crossbuild and publish image + tarballs both also do the make test as requirement

I would not call it a problem but since we are on GitHub with our repo, I think it makes more sense to have the logic from above just in Github actions as well. Furthermore there is the need to support more architectures in our builds.

I've tried to make things different/better via;

  • https://github.com/thanos-io/thanos/issues/4988
  • https://github.com/thanos-io/thanos/issues/4989
  • https://github.com/thanos-io/thanos/pull/4986

but IMO we can go for a different/better direction

Describe the solution you'd like

  • Move all our CI/CD to GitHub, which should make it easier to follow / iterate on than two different environments
  • Remove Promu
  • Use either buildx or even goreleaser
  • Don't build on every commit to main, but let's go towards nightly builds
  • Have an option to build artifacts/images in a PR with something like /build so a user can get a freshly build image to test on
  • I believe we have some compute stuff from the CNCF, so use that as runners instead of the free tier at GitHub (hopefully also helps with flaky tests)
  • Implement parallelism where possible (especially if we are going to add more archs)

wiardvanrij avatar May 09 '22 14:05 wiardvanrij

+100 to a more environmentally friendly approach.

Few notes:

  • We should only build things that matter on a PR (e.g someone changes docs, why we rebuild and test the code?)
  • I believe we have some compute stuff from the CNCF, so use that as runners instead of the free tier at GitHub (hopefully also helps with flaky tests) is not true. I don't think we have any free tier. What I heard actually is that CircleCI increased their open source allowance, so we might want to this direction? I was in Birmingham DevOps days and met them (they had booth) and their tier sound quite generous. GH Action were proven to be harder to use generally (vs CircleCi Orbs Prometheus reuses) and not that fast (flakes we have on even parallel e2e tests). GH action does not work well without e2e tests also (they don' focus on error, you have to scroll a lot - but probably it's our fault)

So.. I wonder if moving fully to GH actions actually helps. Otherwise other points makes very much sense! 💪🏽 Thanks.

bwplotka avatar May 10 '22 11:05 bwplotka

We should also work with Prometheus team to share same Orbs for Circle Ci? and building mechanisms? 🤔

See https://github.com/prometheus/circleci

bwplotka avatar May 10 '22 11:05 bwplotka

I'm all in for helping each other out but we are still two different projects. I've spend quite some time on checking out Promu (for which we use a very old version (: ) but in the end I came to the conclusion;

  • Why use something that's opinionated towards the Prometheus ecosystem?
  • Why not keep it simple? All we need is a test, a build and a release. I.e. goreleaser instantly fixes that without any issues.

I guess tl;dr my point being: there are a lot of things (buildx, goreleaser, native GH stuff) that adds value without creating/making our own tools. It "should" be really easy.

So the Orbs are awesome for exporters and everything related to their project. But I wonder if we even need certain components from it. I took a look at this doc; https://docs.google.com/document/d/1Ql-f_aThl-2eB5v3QdKV_zgBdetLLbdxxChpy-TnWSE/edit#heading=h.24x0bg1hyuak and you see the problems they want to solve. That makes a lot of sense for their rich ecosystem with different projects. I don't think it makes that much sense for Thanos.

As for GH Actions vs circleci, I don't have any hard preference. It's also absolutely not the case that I dislike circleci or that we have issues with it. It's merely keeping it a bit more uniform in one place. I do think that the 'feedback' we get from GH jobs are easier to understand for users.

p.s. for GH, you should be able to see what runners we have. I not an admin/owner of the Thanos org on GH, so I don't have those insights.

wiardvanrij avatar May 10 '22 11:05 wiardvanrij

Why use something that's opinionated towards the Prometheus ecosystem?

We don't. Prometheus also evaluates goreleaser and most likely moves there cc @roidelapluie It would be smart to reuse same technique. Why recreating everything from scratch? Including potential Orb or GH action shared library for this? 🤔

That makes a lot of sense for their rich ecosystem with different projects. I don't think it makes that much sense for Thanos.

I would not be so sure - we start add more projects in to thanos-community and thanos-io orgs and on sibling orgs like observatorium (:

p.s. for GH, you should be able to see what runners we have. I not an admin/owner of the Thanos org on GH, so I don't have those insights.

You are now, just changed it (you are official Thanos maintainer at the end).

As for GH Actions vs circleci, I don't have any hard preference. It's also absolutely not the case that I dislike circleci or that we have issues with it. It's merely keeping it a bit more uniform in one place. I do think that the 'feedback' we get from GH jobs are easier to understand for users

Same here, I don't know answer. My point is that moving to GH actions might not be so default at the end - there might be missed opportunity, so let's think twice.

Overall; I think we are on the same page here (:

bwplotka avatar May 11 '22 14:05 bwplotka

Yea I will definitely check with Prometheus and if that's a solid solution we should (re)use that for sure. Thanks!

p.s. I've now checked, we are on the free plan.

wiardvanrij avatar May 11 '22 15:05 wiardvanrij

While we are evaluating goreleaser, there are a few blockers here. I am investigating adding more capabilities to promu (like deb and rpm's) in the coming weeks, to me promu is not dead.

roidelapluie avatar May 11 '22 15:05 roidelapluie

Input from community hours meeting;

  • Need to check if we can do some colab with circleCI
  • What about new users, we need some gate to prevent everything to run, to prevent mining/etc.

wiardvanrij avatar May 12 '22 11:05 wiardvanrij

Hi folks, I'm a Dev Advocate at CircleCI, and here to hopefully answer any questions you might have - to find if we can indeed help your org out! Thank you @bwplotka for sharing this conversation with me.

What about new users, we need some gate to prevent everything to run, to prevent mining/etc.

I just wrote a blog post about this - you can use CCI contexts in conjunction with user groups you have set up in your GH org to do that - https://circleci.com/blog/access-control-cicd/

Re flaky tests -

Not sure what the flaky test situation is but we also have a detection feature that flags them in the dashboard (if the bigger resource classes don't from the get go).

I'll also have a look at the config & hopefully suggest some ideas for improvement over the next week or so :)

zmarkan avatar May 13 '22 15:05 zmarkan

FWIW, I have had a really good experience with both GitHub actions and @goreleaser while we worked on @parca-dev. I would be happy to give a hand on this @wiardvanrij. Let me know. If we have a chance to use custom runners, I believe we can really speed things up.

We need to configure a CI for thanos-io/objstore anyway, I want to try GH over there.

kakkoyun avatar Jul 20 '22 06:07 kakkoyun

I am in general happy with goreleaser, except that the nightly builds has moved to the Pro/paid tier. So we would still have to set something up for the nightly builds. It's an amazing tools with a lot of features already baked in and a great community maintaining it. Super worth it.

I think it's also a great idea to keep using and also grow promu. Bonus: With goreleaser locking some features behind a pay wall, it would be great to have a completely FOSS alternative that not only Thanos and Prometheus but many other projects can use.

douglascamata avatar Jul 20 '22 08:07 douglascamata

Interesting to add to the discussion: now we bundle a version of Cortex inside internal and also run its unit tests. This is making the build even longer and Cortex's tests are showing intermittent failures: https://app.circleci.com/pipelines/github/thanos-io/thanos/10225/workflows/34954c6e-2d15-4f2a-ba30-728a9072b1ad/jobs/19790

douglascamata avatar Aug 04 '22 09:08 douglascamata

I am in general happy with goreleaser, except that the nightly builds has moved to the Pro/paid tier.

What do you mean by nightly builds? We can build a snapshot version for each PR and attach that to the PR. It works pretty well.

kakkoyun avatar Aug 04 '22 11:08 kakkoyun

Interesting to add to the discussion: now we bundle a version of Cortex inside internal and also run its unit tests. This is making the build even longer and Cortex's tests are showing intermittent failures: https://app.circleci.com/pipelines/github/thanos-io/thanos/10225/workflows/34954c6e-2d15-4f2a-ba30-728a9072b1ad/jobs/19790

I think what we should add is a test cache. So for the PR, only the relevant tests can run. For the merges, we can run the whole suit.

Especially considering we moved objstore tested out of the repo. I would have expected tests to run faster.

kakkoyun avatar Aug 04 '22 11:08 kakkoyun

@kakkoyun goreleaser calls nightly builds what you call snapshot builds, I think:

Whether if you need beta builds or a rolling-release system, the nightly builds feature will do it for you.

Taken from https://goreleaser.com/customization/nightlies/. Our use case would be the rolling release for commits pushed to main.

Note that the idea behind GoReleaser's snapshots is for local builds or to validate your build on the CI pipeline. Artifacts won't be uploaded and will only be generated into the dist folder.

Taken from https://goreleaser.com/customization/snapshots/. Doesn't seem to fit our use case. 🤔

douglascamata avatar Aug 04 '22 12:08 douglascamata

The snapshots are what I've been using for PR branch builds or main branch merge builds for years now :D It basically skips a bunch of checks. And when you add --skip-publish that's all you need. So no need to have a pro version.

kakkoyun avatar Aug 04 '22 12:08 kakkoyun

@kakkoyun but we do want some level of "publishing", don't we? i.e. the container images (that are multi-arch) for each commit. Is this also going well for you? If yes, I think we've got everything we need then.

douglascamata avatar Aug 04 '22 12:08 douglascamata

@douglascamata Aha yes, GoReleaser does a lot of stuff, you're right. We don't use GoReleaser to build the containers, though. We use another step to build images. Yet we mimic the same behaviour, so we build binaries using GoReleaser and then copy them inside the container images. Exactly as GoReleaser.

I don't recommend using GoReleaser because there's no way to produce reproducible builds with docker "buildx" (GoReleaer uses this to produce multi-arch images). You need podman for that, and podman is only available for Pro version. And I believe reproducible builds and signing images are a must-have for an infrastructure project like Thanos to ensure supply chain security.

If we decide to use GoReleaser, I'm happy to implement all these for Thanos. I've spent quite some time on these, and I would be happy to help.

kakkoyun avatar Aug 04 '22 13:08 kakkoyun

That's all very useful information, thanks! 🤓

Especially because I was adding multi-arch images to https://github.com/observatorium and used buildx there (less things to set up, pretty straightforward), without goreleaser. From what I see, we might eventually want to migrate it to a mix of goreleaser & podman too in the future. 🤔

douglascamata avatar Aug 04 '22 13:08 douglascamata