Migrate the robustness tests to prow
What would you like to be added?
After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.
https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml
Advantages:
- More stable, 7% vs 56% failure rate running the same code
- Testgrid being much better UI to monitor failures plus addition of more advanced features like creating bugs alerting.
- Prow being much better tool to view logs from tests, it parses logs, groups them by test, doesn't crash browser, doesn't fail on downloading compressed logs, doesn't created archives.
TODO:
- [x] https://github.com/etcd-io/etcd/pull/18167
- [x] https://github.com/kubernetes/test-infra/pull/32729
- [x] Remove non-arm robustness tests from github actions.
- [ ] Mark robustness presubmits as blocking, when we have confidence in their stability.
- [x] Add arm workers to run robustness periodic
cc @jmhbnz @ivanvc
Why is this needed?
Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.
@ArkaSaha30
Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on https://github.com/kubernetes/k8s.io/issues/6102. So, it may be a blocker for the second point.
Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures.
I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.
I don't see much activity in https://github.com/kubernetes/k8s.io/issues/6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups?
Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated arm64 cluster for prow. Can you please provide an update on the timeline for it being available?
I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.
I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one.
Discussed on Slack with Arka, we'll be working on the following at the moment:
- Migrate robustness documentation to Prow https://github.com/etcd-io/etcd/tree/main/tests/robustness
- Add robustness tests presubmits
/assign @ArkaSaha30 @ivanvc
@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30.
Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
Discussed on Slack with Arka, we'll be working on the following at the moment:
- Migrate robustness documentation to Prow https://github.com/etcd-io/etcd/tree/main/tests/robustness
- Add robustness tests presubmits
/assign @ArkaSaha30 @ivanvc
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/assign
Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on release-3.5 and release-3.4?
The existing robustness periodic and presubmit can be configured to handle all the 3 branches.
There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches.
We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?
We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available.
We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?
We can keep arm64 on Github actions to not block on it.
@ArkaSaha30, can you help with
Remove non-arm robustness tests from github actions.?
Thanks.
Update - arm64 runners were enabled in prow, (refer k8s-infra slack discussions: 1, 2)
- https://github.com/kubernetes/test-infra/pull/33233 Has been raised to create prow
arm64robustness jobs. - https://github.com/etcd-io/etcd/pull/18417 Has been raised to turn off old
arm64robustness GitHub actions workflows.
ci-etcd-robustness-arm64 looks broken.
ci-etcd-robustness-arm64looks broken.
Looking at most recent full run it says:
Test started today at 5:36 PM failed after 1h19m14s.
Job logs show:
{"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143
Looks like job was interrupted? Or is that expected / unrelated output?
Job history shows as aborted: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64
Edit: Interestingly ci-etcd-robustness-main-arm64 was fine https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-main-arm64. I am not too sure on the difference between those two jobs.
@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?
@jmhbnz, @serathius, are we ready to remove
optional: truefrom the robustness presubmit jobs and mark this issue as complete?
We can remove optional: true from the presubmits I believe, the job seems to be behaving about the same if not better than the amd64 equivalent presubmit.
I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.
Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.
Think we are ok to make presubmit job blocking.
I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?
I don't think we can close this yet though, we still have an problem with the
ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
My bad, I thought it was addressed in etcd-io/etcd#17593. I see it's a different issue.
It looks like they are consistently aborted at around 80 minutes. Following early_exit_handler, it seems like the process is being interrupted by its parent. Which sounds consistent with the output from the logs:
{"Time":"2024-08-16T22:50:16.205037989Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (80167): Go OS{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-16T22:50:17Z"}
I wonder if the ARM node or pods inside the node get rotated after 80m.
My high level question, why do we have separated
ci-etcd-robustness-amd64andci-etcd-robutstness-main-amd64(mirrored for arm)?
I'm unsure about this one. Should we only have ci-etcd-robustness-amd64?
Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself.
Link to kubernetes/k8s.io#7241
The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64).
@serathius, should we delete ci-etcd-robustness-main-arm64 and only keep ci-etcd-robustness-arm64?
Don't know the exact differences in the job definition but from those 4 jobs
- ci-etcd-robustness-amd64
- ci-etcd-robustness-arm64
- ci-etcd-robustness-main-amd64
- ci-etcd-robustness-main-arm64
We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as ci-etcd-robustness-release35-amd64 and use the branch name in the job name. So preferably we leave
- ci-etcd-robustness-main-amd64
- ci-etcd-robustness-main-arm64
The difference between the jobs is that ci-etcd-robustness-{amd64,arm64} enables gofail make gofail-enable and builds the project (make build). While ci-etcd-robustness-main-{amd64,arm64}` doesn't.
-
ci-etcd-robustness-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L171-L183result=0 apt-get -o APT::Update::Error-Mode=any update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3 sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf make install-lazyfs set -euo pipefail GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory" make gofail-enable make build VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness || result=$? if [ -d /data/results ]; then zip -r ${ARTIFACTS}/results.zip /data/results fi exit $result -
ci-etcd-robustness-main-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L263-L273result=0 apt update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3 sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf make install-lazyfs set -euo pipefail GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory" VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness-main || result=$? if [ -d /data/results ]; then zip -r ${ARTIFACTS}/results.zip /data/results fi exit $result
Which one would we need to keep, the one with gofail enabled or the other?
The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.
The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep
ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.
Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision.
Lack of building and enabling gofail is expected because the difference between targets make test-robustness which just runs tests (on locally available binary), make test-robustness-main tests etcd from the main branch (downloads, enables gofail and builds).
With the differences cleaned up I think we can leave ci-etcd-robustness-main-{arm64,amd64}.
I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer?