kpt-config-sync icon indicating copy to clipboard operation
kpt-config-sync copied to clipboard

Increase the `parallel` flag of e2e presubmit jobs

Open haiyanmeng opened this issue 3 years ago • 27 comments

Increase the parallel flag from 18 to 24 to reduce the excution time of the kpt-config-sync-presubmit-e2e-multi-repo job, which runs in the large-job-pool and each node in the pool has 30 vCPUs.

Each vCPU corresponds to a hardware thread rather than a core, this PR is to figure out whether setting the parallel flag to 24 could reduce the execution time.

haiyanmeng avatar Sep 08 '22 00:09 haiyanmeng

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haiyanmeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Sep 08 '22 00:09 google-oss-prow[bot]

/cc @sdowell

haiyanmeng avatar Sep 08 '22 00:09 haiyanmeng

SGTM, let's profile test runs on this PR. Perhaps we can create another PR with no changes as a baseline comparison?

sdowell avatar Sep 08 '22 01:09 sdowell

/hold

sdowell avatar Sep 08 '22 01:09 sdowell

/retest

haiyanmeng avatar Sep 08 '22 13:09 haiyanmeng

Perhaps we can create another PR with no changes as a baseline comparison?

We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo

haiyanmeng avatar Sep 08 '22 13:09 haiyanmeng

/retest

haiyanmeng avatar Sep 08 '22 13:09 haiyanmeng

Setting the parallel flag to 21 causes lots of tests to fail. Here are the errors:

new.go:416: ERROR: waiting for ConfigSync Deployments to become available: 2 error(s)
        
        
        [1] KNV9999: deployments.apps "root-reconciler" not found
        
        For more information, see https://g.co/cloud/acm-errors#knv9999
        
        
        [2] KNV9999: failed predicate for deployment/admission-webhook in namespace config-management-system: got status Failed, want Current

        {
          "metadata": {
            "name": "admission-webhook",
            "namespace": "config-management-system",
          ...
          "status": {
            "observedGeneration": 1,
            "replicas": 2,
            "updatedReplicas": 2,
            "unavailableReplicas": 2,
            "conditions": [
              {
                "type": "Available",
                "status": "False",
                "lastUpdateTime": "2022-09-08T14:04:40Z",
                "lastTransitionTime": "2022-09-08T14:04:40Z",
                "reason": "MinimumReplicasUnavailable",
                "message": "Deployment does not have minimum availability."
              },
              {
                "type": "Progressing",
                "status": "False",
                "lastUpdateTime": "2022-09-08T14:14:41Z",
                "lastTransitionTime": "2022-09-08T14:14:41Z",
                "reason": "ProgressDeadlineExceeded",
                "message": "ReplicaSet \"admission-webhook-5c79b59f86\" has timed out progressing."
              }
            ]
          }
        }
        
        For more information, see https://g.co/cloud/acm-errors#knv9999

@sdowell , we have two options here.

  1. I can set parallel to something like 20 to figure out the max parallel setting the pool currently supports;
  2. we can change the node type to see how it goes. We can try c2-standard-60 or some type from C2D machine series: https://cloud.google.com/compute/docs/compute-optimized-machines#c2d-high-mem.

I personally prefer the second option. WDYT?

haiyanmeng avatar Sep 08 '22 14:09 haiyanmeng

Perhaps we can create another PR with no changes as a baseline comparison?

We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo

@haiyanmeng Only issue I see with looking at the job history is that any given job could include changes to the job itself (more tests, functional changes, etc). I suggested created a PR from the same base as this one so that we can confident we have an accurate control when comparing the difference in performance

sdowell avatar Sep 08 '22 16:09 sdowell

@haiyanmeng I suspect the second optional is the viable option if we really want to start pushing the scale of the test parallelism. My only concern is that we are already pretty aggressive on the hardware requirements to run the presubmit tests. I'm not sure how big of a concern that cost is to us, but this direction does increase test infra costs. This becomes a greater concern if we want to start scaling horizontally to support more consecutive presubmit jobs.

I think it may warrant taking a step back and asking whether we want to throw more hardware at the problem

sdowell avatar Sep 08 '22 16:09 sdowell

/retest

haiyanmeng avatar Sep 08 '22 20:09 haiyanmeng

/retest

haiyanmeng avatar Sep 09 '22 00:09 haiyanmeng

/retest

haiyanmeng avatar Sep 09 '22 21:09 haiyanmeng

/retest

haiyanmeng avatar Sep 09 '22 21:09 haiyanmeng

/retest

haiyanmeng avatar Sep 09 '22 22:09 haiyanmeng

/retest

haiyanmeng avatar Sep 09 '22 23:09 haiyanmeng

/retest

haiyanmeng avatar Sep 11 '22 23:09 haiyanmeng

/retest

haiyanmeng avatar Sep 11 '22 23:09 haiyanmeng

/retest

haiyanmeng avatar Sep 12 '22 01:09 haiyanmeng

/retest

haiyanmeng avatar Sep 12 '22 15:09 haiyanmeng

/retest

haiyanmeng avatar Sep 12 '22 20:09 haiyanmeng

/retest

haiyanmeng avatar Sep 12 '22 22:09 haiyanmeng

/retest

haiyanmeng avatar Sep 12 '22 23:09 haiyanmeng

/retest

haiyanmeng avatar Sep 13 '22 00:09 haiyanmeng

/retest

haiyanmeng avatar Sep 13 '22 01:09 haiyanmeng

/retest

haiyanmeng avatar Sep 13 '22 13:09 haiyanmeng

@haiyanmeng: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kpt-config-sync-presubmit 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab link true /test kpt-config-sync-presubmit
kpt-config-sync-presubmit-e2e-mono-repo 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab link true /test kpt-config-sync-presubmit-e2e-mono-repo
kpt-config-sync-presubmit-e2e-multi-repo 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab link true /test kpt-config-sync-presubmit-e2e-multi-repo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

google-oss-prow[bot] avatar Sep 14 '22 20:09 google-oss-prow[bot]

should we just close this?

mikebz avatar Apr 19 '23 15:04 mikebz