etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Migrate the robustness tests to prow

Open serathius opened this issue 1 year ago • 28 comments

What would you like to be added?

After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.

https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml

Advantages:

  • More stable, 7% vs 56% failure rate running the same code
  • Testgrid being much better UI to monitor failures plus addition of more advanced features like creating bugs alerting.
  • Prow being much better tool to view logs from tests, it parses logs, groups them by test, doesn't crash browser, doesn't fail on downloading compressed logs, doesn't created archives.

TODO:

  • [x] https://github.com/etcd-io/etcd/pull/18167
  • [x] https://github.com/kubernetes/test-infra/pull/32729
  • [x] Remove non-arm robustness tests from github actions.
  • [ ] Mark robustness presubmits as blocking, when we have confidence in their stability.
  • [x] Add arm workers to run robustness periodic

cc @jmhbnz @ivanvc

Why is this needed?

Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.

serathius avatar Jun 06 '24 08:06 serathius

@ArkaSaha30

henrybear327 avatar Jun 06 '24 11:06 henrybear327

Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on https://github.com/kubernetes/k8s.io/issues/6102. So, it may be a blocker for the second point.

ivanvc avatar Jun 06 '24 13:06 ivanvc

Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures.

serathius avatar Jun 06 '24 15:06 serathius

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I don't see much activity in https://github.com/kubernetes/k8s.io/issues/6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups?

ivanvc avatar Jun 06 '24 20:06 ivanvc

Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated arm64 cluster for prow. Can you please provide an update on the timeline for it being available?

jmhbnz avatar Jun 06 '24 22:06 jmhbnz

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one.

serathius avatar Jun 07 '24 08:06 serathius

Discussed on Slack with Arka, we'll be working on the following at the moment:

  • Migrate robustness documentation to Prow https://github.com/etcd-io/etcd/tree/main/tests/robustness
  • Add robustness tests presubmits

/assign @ArkaSaha30 @ivanvc

ivanvc avatar Jun 07 '24 18:06 ivanvc

@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

Discussed on Slack with Arka, we'll be working on the following at the moment:

  • Migrate robustness documentation to Prow https://github.com/etcd-io/etcd/tree/main/tests/robustness
  • Add robustness tests presubmits

/assign @ArkaSaha30 @ivanvc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 07 '24 18:06 k8s-ci-robot

/assign

ArkaSaha30 avatar Jun 08 '24 03:06 ArkaSaha30

Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on release-3.5 and release-3.4?
The existing robustness periodic and presubmit can be configured to handle all the 3 branches.

ArkaSaha30 avatar Jun 10 '24 14:06 ArkaSaha30

There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches.

serathius avatar Jun 10 '24 14:06 serathius

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available.

ivanvc avatar Jun 13 '24 23:06 ivanvc

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can keep arm64 on Github actions to not block on it.

serathius avatar Jun 14 '24 06:06 serathius

@ArkaSaha30, can you help with

Remove non-arm robustness tests from github actions.?

Thanks.

ivanvc avatar Jun 14 '24 17:06 ivanvc

Update - arm64 runners were enabled in prow, (refer k8s-infra slack discussions: 1, 2)

  • https://github.com/kubernetes/test-infra/pull/33233 Has been raised to create prow arm64 robustness jobs.
  • https://github.com/etcd-io/etcd/pull/18417 Has been raised to turn off old arm64 robustness GitHub actions workflows.

jmhbnz avatar Aug 08 '24 02:08 jmhbnz

ci-etcd-robustness-arm64 looks broken. image

serathius avatar Aug 08 '24 07:08 serathius

ci-etcd-robustness-arm64 looks broken.

Looking at most recent full run it says:

Test started today at 5:36 PM failed after 1h19m14s.

Job logs show:

 {"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143 

Looks like job was interrupted? Or is that expected / unrelated output?

Job config is here.

Job history shows as aborted: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64

Edit: Interestingly ci-etcd-robustness-main-arm64 was fine https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-main-arm64. I am not too sure on the difference between those two jobs.

jmhbnz avatar Aug 08 '24 07:08 jmhbnz

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

ivanvc avatar Aug 16 '24 06:08 ivanvc

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

We can remove optional: true from the presubmits I believe, the job seems to be behaving about the same if not better than the amd64 equivalent presubmit.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

jmhbnz avatar Aug 16 '24 08:08 jmhbnz

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

Think we are ok to make presubmit job blocking.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

serathius avatar Aug 16 '24 08:08 serathius

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My bad, I thought it was addressed in etcd-io/etcd#17593. I see it's a different issue.

It looks like they are consistently aborted at around 80 minutes. Following early_exit_handler, it seems like the process is being interrupted by its parent. Which sounds consistent with the output from the logs:

{"Time":"2024-08-16T22:50:16.205037989Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (80167): Go OS{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-16T22:50:17Z"}

I wonder if the ARM node or pods inside the node get rotated after 80m.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

I'm unsure about this one. Should we only have ci-etcd-robustness-amd64?

ivanvc avatar Aug 16 '24 23:08 ivanvc

Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself.

ivanvc avatar Aug 30 '24 04:08 ivanvc

Link to kubernetes/k8s.io#7241

ivanvc avatar Aug 30 '24 18:08 ivanvc

The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64).

@serathius, should we delete ci-etcd-robustness-main-arm64 and only keep ci-etcd-robustness-arm64?

ivanvc avatar Sep 03 '24 23:09 ivanvc

Don't know the exact differences in the job definition but from those 4 jobs

  • ci-etcd-robustness-amd64
  • ci-etcd-robustness-arm64
  • ci-etcd-robustness-main-amd64
  • ci-etcd-robustness-main-arm64

We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as ci-etcd-robustness-release35-amd64 and use the branch name in the job name. So preferably we leave

  • ci-etcd-robustness-main-amd64
  • ci-etcd-robustness-main-arm64

serathius avatar Sep 04 '24 07:09 serathius

The difference between the jobs is that ci-etcd-robustness-{amd64,arm64} enables gofail make gofail-enable and builds the project (make build). While ci-etcd-robustness-main-{amd64,arm64}` doesn't.

  • ci-etcd-robustness-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L171-L183
    result=0
    apt-get -o APT::Update::Error-Mode=any update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
    sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
    make install-lazyfs
    set -euo pipefail
    GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
    make gofail-enable
    make build
    VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness || result=$?
    if [ -d /data/results ]; then
      zip -r ${ARTIFACTS}/results.zip /data/results
    fi
    exit $result
    
  • ci-etcd-robustness-main-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L263-L273
    result=0
    apt update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
    sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
    make install-lazyfs
    set -euo pipefail
    GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
    VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness-main || result=$?
    if [ -d /data/results ]; then
      zip -r ${ARTIFACTS}/results.zip /data/results
    fi
    exit $result
    

Which one would we need to keep, the one with gofail enabled or the other?

ivanvc avatar Sep 04 '24 18:09 ivanvc

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

ivanvc avatar Sep 06 '24 23:09 ivanvc

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision.

jmhbnz avatar Sep 07 '24 04:09 jmhbnz

Lack of building and enabling gofail is expected because the difference between targets make test-robustness which just runs tests (on locally available binary), make test-robustness-main tests etcd from the main branch (downloads, enables gofail and builds).

With the differences cleaned up I think we can leave ci-etcd-robustness-main-{arm64,amd64}.

serathius avatar Sep 09 '24 07:09 serathius

I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer?

ivanvc avatar Sep 10 '24 17:09 ivanvc