kubeapps icon indicating copy to clipboard operation
kubeapps copied to clipboard

Migrate CI to Github actions

Open castelblanque opened this issue 2 years ago • 18 comments

Description:

Move CI to Github actions

  • Benefit: Be aligned with most of the Tanzu projects using it

Related to #4096

Progress

  • [x] test_go
  • [x] test_dashboard
  • [x] test_pinniped_proxy
  • [x] test_chart_render
  • [x] build_go_images
  • [x] build_dashboard
  • [x] build_pinniped_proxy
  • [x] build_e2e_runner
  • [x] local_e2e_tests
  • [x] sync_chart_from_bitnami
  • [x] GKE_REGULAR_VERSION_MAIN
  • [x] GKE_REGULAR_VERSION_LATEST_RELEASE
  • [x] GKE_STABLE_VERSION_MAIN
  • [x] GKE_STABLE_VERSION_LATEST_RELEASE
  • [x] push_images
  • [ ] report_srp
  • [x] sync_chart_to_bitnami
  • [ ] release
  • [ ] Switch off the CircleCI pipeline.
  • [ ] Update CI documentation
  • [ ] Fix failing workflows triggered by dependabot's PRs.
  • [ ] Unify workflows as much as possible

castelblanque avatar Mar 15 '22 13:03 castelblanque

In the meantime, we could start splitting some logic into some container images, as Rafa suggested (see https://github.com/vmware-tanzu/kubeapps/pull/5177#pullrequestreview-1083477458)

antgamdia avatar Aug 24 '22 09:08 antgamdia

The first iteration of the migration has been completed with the merge of the first version of the new workflow, which includes those jobs that are run for every single push of a branch: test, build images, push images, run e2e tests, and sync from/to Bitnami chart (all but those that belong to the pre-release and release flows, and report_srp).

beni0888 avatar Oct 05 '22 08:10 beni0888

Great work @beni0888 !!!

ppbaena avatar Oct 05 '22 08:10 ppbaena

Awesome @beni0888! Are we storing artifacts from E2E in case they fail? (e.g. screenshots, videos, logs, etc.)

castelblanque avatar Oct 05 '22 09:10 castelblanque

I think so, under the "artifacts" section in the github action: However, they are stored in a zip, so we have to download and uncompress it to view the files. An extra step, unfortunately, but not a big deal.

image

antgamdia avatar Oct 05 '22 10:10 antgamdia

Yeah, as Antonio said, in case E2E tests fail, the reports are stored under the Artifacts section in the workflow run.

beni0888 avatar Oct 05 '22 10:10 beni0888

There is no way to get the artifacts downloaded independently like in CircleCI, but only as a zip file. Citing GH action documentation: There is currently no way to download artifacts after a workflow run finishes in a format other than a zip or to download artifact contents individually. This will make debugging errors in CI slower, e.g. a huge file will have to be downloaded just to see one picture of failing E2E.

castelblanque avatar Oct 05 '22 10:10 castelblanque

Lately, I've been fighting against the issues that were making the jobs sync_from_bitnami and sync_to_bitnami fail. A thorough explanation of the issues faced and the solutions applied can be found in #5524.

beni0888 avatar Oct 20 '22 10:10 beni0888

I've filed a new PR to avoid the production version of the docker images generated from GHA to overlap and overwrite the ones generated from CircleCI. So far, that is being done for the development images (those whose name is appended the -ci suffix, and -ci-gha for the GHA versions) but not for the production ones.

beni0888 avatar Oct 20 '22 16:10 beni0888

I noticed that dependabot's PRs are failing at push_dev_images, I guess it should be because GH is considering dependabot as an outsider and secrets aren't available in the triggered workflow. I need to dig deeper into that issue and find out the best approach to fix it. Adding a new task to the list.

beni0888 avatar Oct 21 '22 06:10 beni0888

A fact to comment, just for knowledge sharing purposes. It seems that CircleCI allows defining environment variables both at the project level through its app, and also in the workflow config file. Those env vars defined at the project level would be the equivalent of GHA’s secrets.

beni0888 avatar Oct 21 '22 07:10 beni0888

Latest issues faced:

  • The GHA runner (ubuntu-latest in our case) comes with a set of preinstalled software applications, among which is the GCloud SDK. This raises some problems with the current scripts used from CircleCI pipeline:
    • As the script performs its own installation, we end up with multiple installations and cannot be sure which of them is used by default (well, the truth is that it is the preinstalled one unless we explicitly choose the other one).
    • The preinstalled version of GCloud-SDK doesn't allow installing plugins through gcloud components install, so I had to find an alternative solution and do it via apt-get install which requires previously add the DEB source for Google Cloud and its corresponding keyring.
  • GKE max allowed length for the name of the clusters is 40 characters, but our current scripts aren't taking that into account (I guess it's because we only run these jobs for the pre-release workflow) so it exploded in my face.
  • The current set of scripts isn't configured to run under the unofficial "bash strict mode" (set -euo pipefail), which is the recommended way in case you want to avoid tricky bugs and long debugging sessions, so when I run them into strict mode some issues appeared, like unbound variables, that aren't straightforward to fix because the need some investigation to understand how those variables are being filled in CircleCI.

beni0888 avatar Oct 25 '22 05:10 beni0888

Here is a bug from the current CircleCI setup I've just discovered. As you can see in the following screenshot, as a result of the fact of the lack of value for some of the positional parameters passed in the call to the script /script/e2e-test.sh, inside the script the variables DEX_IP and ADDITIONAL_CLUSTER_IP are taking the values from the parameters KAPP_CONTROLLER_VERSION and CHART_MUSEUM_VERSION. This is currently not causing any misbehavior in our current CircleCI workflow because the multi-cluster scenario is not being tested in GKE tests, but it is certainly a ticking bomb waiting to explode at any moment in the future, in case we decide to also test those scenarios in GKE or whatever other Kubernetes flavor. Screenshot 2022-10-25 at 08 29 18

beni0888 avatar Oct 25 '22 06:10 beni0888

The GHA runner (ubuntu-latest in our case) comes with a set of preinstalled software applications

I would assume that a container with ubuntu-lastest comes clean, but it is far from reality -> See list of bundled software here. There is Google Cloud SDK, indeed.

GKE max allowed length for the name of the clusters is 40 characters

Until now we may have run into this issue sometimes while developing, I believe. Do you think it needs to be solved now or can wait?

The current set of scripts isn't configured to run under the unofficial "bash strict mode"

This is not strictly related to GHA, so maybe creating a separate issue (tech debt) to be tackled at some point in the future?

result of the fact of the lack of value for some of the positional parameters passed in the call to the script

Passing positional parameters has given a lot of headaches in the past. I would really switch to using named environment variables. We can always check inside the script if the variable exists. What do you think?

castelblanque avatar Oct 25 '22 06:10 castelblanque

Hey @castelblanque, excuse me for my late response, yesterday I was too focus on the cluster-creation issues... Anwering to you comments:

  • Yeah, GHA's runners comes with a set of preinstalled software, the notice you about that and provide a link to the details page for each runner in each job execution logs.
  • Regarding the GKE max allowed length, I've taken advantage to fix it in the PR I'm currently working on.
  • With regards to "bash strict mode", I'm also taking advantage to apply it in the current PR, and I'm facing the issues that appear when you turn it on (eg. unbound variables).
  • And the same for the positional arguments issue, I'm switching to global variables for the scripts I'm touching.

beni0888 avatar Oct 27 '22 08:10 beni0888

Right now I'm working in the migration of the GKE jobs to GHA. During this process I've stumbled upon several stones along the way...

  • GKE cluster creation failing in CircleCI: suddenly and surprisingly, the creation of GKE clusters started to fail in the CircleCI pipeline, even though it had been working smoothly so far. The reason is that Google guys have introduced a bug in the latest release of google-cloud-skd that makes it fail when providing labes --labels=KEY=VALUE,... in the call for creating the cluster. After asking to the team and the SRE people, it seems that we can get rid of the current labels we're setting when creating clusters team=kubeapps, at least until a new google-cloud-sdk release fixes the problem.
  • Problems providing input when calling reusable workflows from GHA:
    • You cannot use environment variables inside the with object used for passing input parameters to reusable workflows. That led me to the creation of an intermediate step that uses the required env vars to provide some output variables, and to take the output variables from that job in the with block.
    • When you just define the output as the value from an env var (eg: SOME_OUTPUT_VAR: $SOME_ENV_VAR), without setting its value from inside a job’s step with something like echo "SOME_OUTPUT_VAR=$SOME_ENV_VAR" >> $GITHUB_OUTPUT , it turns out that it works smoothly when you consume that output from a different job in the same workflow, but you receive empty values when you consume it as the input of a reusable workflow.

beni0888 avatar Oct 27 '22 09:10 beni0888

Adding new findouts for the records:

  • Currently we have for different GKE-related jobs in CircleCI: GKE_REGULAR_VERSION_MAIN, GKE_REGULAR_VERSION_LATEST_RELEASE, GKE_STABLE_VERSION_MAIN and GKE_STABLE_VERSION_LATEST_RELEASE. But it turns out that the *_LATEST_RELEASE ones are never executed because in the check_conditions step of those jobs we're checking for the result of a changedVersion function that doesn't exist. So can safely remove those jobs.
  • Regarding the two remaining GKE jobs, each one of them is testing what we are calling GKE BRANCH, which corresponds with the GKE release channels (rapid, regular, stable), but we're just providing the GKE/k8s numeric version (eg. 1.22), and that has the problem that we are relying on that value for generating the cluster name for each job, so the names generated from each of the jobs are identical and that makes the cluster creation step to fail in the job that arrives later. To fix this, I've introduced a new variable called GKE_RELEASE_CHANNEL in the GHA workflow, that will content the values from the release channel (stable or regular) so we prevent from facing that problem again when versions are the same. In CircleCI, what's being done right now is to rerun the failing job after the other one has finished, we can follow a similar approach for a permanent fix, or just wait until we finally decomission that pipeline 🤞🏻

beni0888 avatar Oct 27 '22 16:10 beni0888

GKE jobs have been migrated to GHA 🎉 We are almost there!

beni0888 avatar Oct 28 '22 14:10 beni0888

The prerelease flow has been migrated to GHA. The way we've done it is by adding a new manually-triggered workflow called Full Integration Pipeline that calls the general pipeline defined in /.github/workflows/kubeapps-general.yaml passing an input parameter run_gke_tests indicating that the GKE jobs have to be run. The file kubeapps-general.yaml contains the definition of the whole pipeline as a reusable workflow, so we can call it with some parameters to adapt its execution to different scenarios: prerelease, release, new commit, etc. The release job has been also migrated to GHA as part of the kubeapps-general.yml workflow, and a new workflow called Release Pipeline has been created. It is called automatically when new version tags (vX.Y.Z) are created. It also calls the kubeapps-general.yml workflow with the parameter run_gke_tests equal to true, so the GKE tests are also executed for this scenario.

beni0888 avatar Nov 02 '22 16:11 beni0888

After digging into the problem of workflows triggered by dependabot consistently failing, I noticed that GHA secrets are not available for the workflows triggered by dependabot, so we need to replicate those secrets for it in its own secrets configuration section.

beni0888 avatar Nov 02 '22 16:11 beni0888

Thanks @beni0888! Now that the prerelease/release is migrated to GHA, we should update the release process documentation that now points to CircleCI?

castelblanque avatar Nov 03 '22 07:11 castelblanque

Thanks @beni0888! Now that the prerelease/release is migrated to GHA, we should update the release process documentation that now points to CircleCI?

Yeah, that document needs to be updated, but take into account that the GHA CI is still running in DEV_MODE, so the CI actually doing the release is still CircleCI. But yes, as soon as we migrate the report_srp job to GHA, and make sure that the whole CI is working properly (which it seems it is), we can turn off the CircleCI pipeline and the DEV_MODE for GHA, and that document will need to be updated to reflect the new scenario. Thanks for the heads up!

beni0888 avatar Nov 03 '22 12:11 beni0888

After digging into the problem of workflows triggered by dependabot consistently failing, I noticed that GHA secrets are not available for the workflows triggered by dependabot, so we need to replicate those secrets for it in its own secrets configuration section.

It seems that the problem has been solved after adding the secrets DOCKER_USERNAME and DOCKER_PASSWORD in the dependabot's secrets section.

beni0888 avatar Nov 03 '22 12:11 beni0888

The work for unifying the several existent workflows as much as possible has been completed in this PR. We group together all the workflows performing linter tasks in a reusable workflow that contains all of them, and that's included in the kubeapps-general workflow, so they are integrated as part of the general pipeline.

beni0888 avatar Nov 07 '22 12:11 beni0888

The fix of the flaky test in the local_e2e_tests' operator group has been tackle in #5613

beni0888 avatar Nov 08 '22 08:11 beni0888

To be able to report the SRP source provenance from GHA, we need to generate and ask for the registration of a new SRP UID. I have filed this ticket in Jira Service Desk for that.

beni0888 avatar Nov 11 '22 15:11 beni0888

According to the activity in the Jira Ticket, our request has been fulfilled and the new SRP UID has been registered, so we should be able to report the source provenance from GHA.

beni0888 avatar Nov 15 '22 10:11 beni0888

I have confirmed with the SRP team that we are already properly reporting the source provenance from the GHA pipeline 🎉

beni0888 avatar Nov 17 '22 15:11 beni0888

Nice - looks like you've completed everything other than the easy parts of just switching circle CI off and docs! Well done @beni0888

absoludity avatar Nov 17 '22 19:11 absoludity

I've been working on fixing some minor issues that were preventing the release workflow from working properly. Fortunately, it seems that it finally does :tada: Although it's still running in DEV_MODE and not actually creating the release, so we cannot be 100% sure until we trigger a real release from the GHA pipeline.

beni0888 avatar Nov 25 '22 16:11 beni0888