kubeapps
kubeapps copied to clipboard
Migrate CI to Github actions
Description:
Move CI to Github actions
- Benefit: Be aligned with most of the Tanzu projects using it
Related to #4096
Progress
- [x] test_go
- [x] test_dashboard
- [x] test_pinniped_proxy
- [x] test_chart_render
- [x] build_go_images
- [x] build_dashboard
- [x] build_pinniped_proxy
- [x] build_e2e_runner
- [x] local_e2e_tests
- [x] sync_chart_from_bitnami
- [x] GKE_REGULAR_VERSION_MAIN
- [x] GKE_REGULAR_VERSION_LATEST_RELEASE
- [x] GKE_STABLE_VERSION_MAIN
- [x] GKE_STABLE_VERSION_LATEST_RELEASE
- [x] push_images
- [ ] report_srp
- [x] sync_chart_to_bitnami
- [ ] release
- [ ] Switch off the CircleCI pipeline.
- [ ] Update CI documentation
- [ ] Fix failing workflows triggered by dependabot's PRs.
- [ ] Unify workflows as much as possible
In the meantime, we could start splitting some logic into some container images, as Rafa suggested (see https://github.com/vmware-tanzu/kubeapps/pull/5177#pullrequestreview-1083477458)
The first iteration of the migration has been completed with the merge of the first version of the new workflow, which includes those jobs that are run for every single push of a branch: test, build images, push images, run e2e tests, and sync from/to Bitnami chart (all but those that belong to the pre-release
and release
flows, and report_srp
).
Great work @beni0888 !!!
Awesome @beni0888! Are we storing artifacts from E2E in case they fail? (e.g. screenshots, videos, logs, etc.)
I think so, under the "artifacts" section in the github action: However, they are stored in a zip, so we have to download and uncompress it to view the files. An extra step, unfortunately, but not a big deal.
Yeah, as Antonio said, in case E2E tests fail, the reports are stored under the Artifacts
section in the workflow run.
There is no way to get the artifacts downloaded independently like in CircleCI, but only as a zip file.
Citing GH action documentation: There is currently no way to download artifacts after a workflow run finishes in a format other than a zip or to download artifact contents individually.
This will make debugging errors in CI slower, e.g. a huge file will have to be downloaded just to see one picture of failing E2E.
Lately, I've been fighting against the issues that were making the jobs sync_from_bitnami
and sync_to_bitnami
fail. A thorough explanation of the issues faced and the solutions applied can be found in #5524.
I've filed a new PR to avoid the production version of the docker images generated from GHA to overlap and overwrite the ones generated from CircleCI. So far, that is being done for the development images (those whose name is appended the -ci
suffix, and -ci-gha
for the GHA versions) but not for the production ones.
I noticed that dependabot
's PRs are failing at push_dev_images
, I guess it should be because GH is considering dependabot as an outsider and secrets aren't available in the triggered workflow. I need to dig deeper into that issue and find out the best approach to fix it. Adding a new task to the list.
A fact to comment, just for knowledge sharing purposes. It seems that CircleCI allows defining environment variables both at the project level through its app, and also in the workflow config file. Those env vars defined at the project level would be the equivalent of GHA’s secrets.
Latest issues faced:
- The GHA runner (
ubuntu-latest
in our case) comes with a set of preinstalled software applications, among which is theGCloud SDK
. This raises some problems with the current scripts used fromCircleCI
pipeline:- As the script performs its own installation, we end up with multiple installations and cannot be sure which of them is used by default (well, the truth is that it is the preinstalled one unless we explicitly choose the other one).
- The preinstalled version of
GCloud-SDK
doesn't allow installing plugins throughgcloud components install
, so I had to find an alternative solution and do it viaapt-get install
which requires previously add theDEB
source for Google Cloud and its corresponding keyring.
- GKE max allowed length for the name of the clusters is 40 characters, but our current scripts aren't taking that into account (I guess it's because we only run these jobs for the pre-release workflow) so it exploded in my face.
- The current set of scripts isn't configured to run under the unofficial "bash strict mode" (
set -euo pipefail
), which is the recommended way in case you want to avoid tricky bugs and long debugging sessions, so when I run them into strict mode some issues appeared, like unbound variables, that aren't straightforward to fix because the need some investigation to understand how those variables are being filled inCircleCI
.
Here is a bug from the current CircleCI
setup I've just discovered. As you can see in the following screenshot, as a result of the fact of the lack of value for some of the positional parameters passed in the call to the script /script/e2e-test.sh
, inside the script the variables DEX_IP
and ADDITIONAL_CLUSTER_IP
are taking the values from the parameters KAPP_CONTROLLER_VERSION
and CHART_MUSEUM_VERSION
. This is currently not causing any misbehavior in our current CircleCI
workflow because the multi-cluster scenario is not being tested in GKE tests, but it is certainly a ticking bomb waiting to explode at any moment in the future, in case we decide to also test those scenarios in GKE or whatever other Kubernetes flavor.
The GHA runner (
ubuntu-latest
in our case) comes with a set of preinstalled software applications
I would assume that a container with ubuntu-lastest
comes clean, but it is far from reality -> See list of bundled software here. There is Google Cloud SDK, indeed.
GKE max allowed length for the name of the clusters is 40 characters
Until now we may have run into this issue sometimes while developing, I believe. Do you think it needs to be solved now or can wait?
The current set of scripts isn't configured to run under the unofficial "bash strict mode"
This is not strictly related to GHA, so maybe creating a separate issue (tech debt) to be tackled at some point in the future?
result of the fact of the lack of value for some of the positional parameters passed in the call to the script
Passing positional parameters has given a lot of headaches in the past. I would really switch to using named environment variables. We can always check inside the script if the variable exists. What do you think?
Hey @castelblanque, excuse me for my late response, yesterday I was too focus on the cluster-creation issues... Anwering to you comments:
- Yeah, GHA's runners comes with a set of preinstalled software, the notice you about that and provide a link to the details page for each runner in each job execution logs.
- Regarding the GKE max allowed length, I've taken advantage to fix it in the PR I'm currently working on.
- With regards to "bash strict mode", I'm also taking advantage to apply it in the current PR, and I'm facing the issues that appear when you turn it on (eg. unbound variables).
- And the same for the positional arguments issue, I'm switching to global variables for the scripts I'm touching.
Right now I'm working in the migration of the GKE jobs to GHA. During this process I've stumbled upon several stones along the way...
-
GKE cluster creation failing in CircleCI: suddenly and surprisingly, the creation of GKE clusters started to fail in the CircleCI pipeline, even though it had been working smoothly so far. The reason is that Google guys have introduced a bug in the latest release of google-cloud-skd that makes it fail when providing labes
--labels=KEY=VALUE,...
in the call for creating the cluster. After asking to the team and the SRE people, it seems that we can get rid of the current labels we're setting when creating clustersteam=kubeapps
, at least until a new google-cloud-sdk release fixes the problem. -
Problems providing input when calling reusable workflows from GHA:
- You cannot use environment variables inside the
with
object used for passing input parameters to reusable workflows. That led me to the creation of an intermediate step that uses the required env vars to provide some output variables, and to take the output variables from that job in thewith
block. - When you just define the output as the value from an env var (eg:
SOME_OUTPUT_VAR: $SOME_ENV_VAR
), without setting its value from inside a job’s step with something likeecho "SOME_OUTPUT_VAR=$SOME_ENV_VAR" >> $GITHUB_OUTPUT
, it turns out that it works smoothly when you consume that output from a different job in the same workflow, but you receive empty values when you consume it as the input of a reusable workflow.
- You cannot use environment variables inside the
Adding new findouts for the records:
- Currently we have for different GKE-related jobs in CircleCI:
GKE_REGULAR_VERSION_MAIN
,GKE_REGULAR_VERSION_LATEST_RELEASE
,GKE_STABLE_VERSION_MAIN
andGKE_STABLE_VERSION_LATEST_RELEASE
. But it turns out that the*_LATEST_RELEASE
ones are never executed because in thecheck_conditions
step of those jobs we're checking for the result of achangedVersion
function that doesn't exist. So can safely remove those jobs. - Regarding the two remaining GKE jobs, each one of them is testing what we are calling
GKE BRANCH
, which corresponds with the GKE release channels (rapid, regular, stable), but we're just providing the GKE/k8s numeric version (eg.1.22
), and that has the problem that we are relying on that value for generating the cluster name for each job, so the names generated from each of the jobs are identical and that makes the cluster creation step to fail in the job that arrives later. To fix this, I've introduced a new variable calledGKE_RELEASE_CHANNEL
in the GHA workflow, that will content the values from the release channel (stable
orregular
) so we prevent from facing that problem again when versions are the same. In CircleCI, what's being done right now is to rerun the failing job after the other one has finished, we can follow a similar approach for a permanent fix, or just wait until we finally decomission that pipeline 🤞🏻
GKE jobs have been migrated to GHA 🎉 We are almost there!
The prerelease
flow has been migrated to GHA. The way we've done it is by adding a new manually-triggered workflow called Full Integration Pipeline
that calls the general pipeline defined in /.github/workflows/kubeapps-general.yaml
passing an input parameter run_gke_tests
indicating that the GKE jobs have to be run. The file kubeapps-general.yaml
contains the definition of the whole pipeline as a reusable workflow, so we can call it with some parameters to adapt its execution to different scenarios: prerelease, release, new commit, etc.
The release
job has been also migrated to GHA as part of the kubeapps-general.yml
workflow, and a new workflow called Release Pipeline
has been created. It is called automatically when new version tags (vX.Y.Z) are created. It also calls the kubeapps-general.yml
workflow with the parameter run_gke_tests
equal to true, so the GKE tests are also executed for this scenario.
After digging into the problem of workflows triggered by dependabot consistently failing, I noticed that GHA secrets are not available for the workflows triggered by dependabot, so we need to replicate those secrets for it in its own secrets configuration section.
Thanks @beni0888!
Now that the prerelease
/release
is migrated to GHA, we should update the release process documentation that now points to CircleCI?
Thanks @beni0888! Now that the
prerelease
/release
is migrated to GHA, we should update the release process documentation that now points to CircleCI?
Yeah, that document needs to be updated, but take into account that the GHA CI is still running in DEV_MODE, so the CI actually doing the release is still CircleCI
. But yes, as soon as we migrate the report_srp
job to GHA, and make sure that the whole CI is working properly (which it seems it is), we can turn off the CircleCI pipeline and the DEV_MODE for GHA, and that document will need to be updated to reflect the new scenario. Thanks for the heads up!
After digging into the problem of workflows triggered by dependabot consistently failing, I noticed that GHA secrets are not available for the workflows triggered by dependabot, so we need to replicate those secrets for it in its own secrets configuration section.
It seems that the problem has been solved after adding the secrets DOCKER_USERNAME
and DOCKER_PASSWORD
in the dependabot's secrets section.
The work for unifying the several existent workflows as much as possible has been completed in this PR. We group together all the workflows performing linter tasks in a reusable workflow that contains all of them, and that's included in the kubeapps-general
workflow, so they are integrated as part of the general pipeline.
The fix of the flaky test in the local_e2e_tests' operator group has been tackle in #5613
To be able to report the SRP source provenance from GHA, we need to generate and ask for the registration of a new SRP UID. I have filed this ticket in Jira Service Desk for that.
According to the activity in the Jira Ticket, our request has been fulfilled and the new SRP UID has been registered, so we should be able to report the source provenance from GHA.
I have confirmed with the SRP team that we are already properly reporting the source provenance from the GHA pipeline 🎉
Nice - looks like you've completed everything other than the easy parts of just switching circle CI off and docs! Well done @beni0888
I've been working on fixing some minor issues that were preventing the release
workflow from working properly. Fortunately, it seems that it finally does :tada: Although it's still running in DEV_MODE
and not actually creating the release, so we cannot be 100% sure until we trigger a real release from the GHA pipeline.