kpt-config-sync Increase the `parallel` flag of e2e presubmit jobs

Increase the parallel flag from 18 to 24 to reduce the excution time of the kpt-config-sync-presubmit-e2e-multi-repo job, which runs in the large-job-pool and each node in the pool has 30 vCPUs.

Each vCPU corresponds to a hardware thread rather than a core, this PR is to figure out whether setting the parallel flag to 24 could reduce the execution time.

Sep 08 '22 00:09 haiyanmeng

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haiyanmeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [haiyanmeng]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 08 '22 00:09 google-oss-prow[bot]

/cc @sdowell

Sep 08 '22 00:09 haiyanmeng

SGTM, let's profile test runs on this PR. Perhaps we can create another PR with no changes as a baseline comparison?

Sep 08 '22 01:09 sdowell

/hold

Sep 08 '22 01:09 sdowell

/retest

Sep 08 '22 13:09 haiyanmeng

Perhaps we can create another PR with no changes as a baseline comparison?

We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo

Sep 08 '22 13:09 haiyanmeng

/retest

Sep 08 '22 13:09 haiyanmeng

Setting the parallel flag to 21 causes lots of tests to fail. Here are the errors:

new.go:416: ERROR: waiting for ConfigSync Deployments to become available: 2 error(s)
        
        
        [1] KNV9999: deployments.apps "root-reconciler" not found
        
        For more information, see https://g.co/cloud/acm-errors#knv9999
        
        
        [2] KNV9999: failed predicate for deployment/admission-webhook in namespace config-management-system: got status Failed, want Current

        {
          "metadata": {
            "name": "admission-webhook",
            "namespace": "config-management-system",
          ...
          "status": {
            "observedGeneration": 1,
            "replicas": 2,
            "updatedReplicas": 2,
            "unavailableReplicas": 2,
            "conditions": [
              {
                "type": "Available",
                "status": "False",
                "lastUpdateTime": "2022-09-08T14:04:40Z",
                "lastTransitionTime": "2022-09-08T14:04:40Z",
                "reason": "MinimumReplicasUnavailable",
                "message": "Deployment does not have minimum availability."
              },
              {
                "type": "Progressing",
                "status": "False",
                "lastUpdateTime": "2022-09-08T14:14:41Z",
                "lastTransitionTime": "2022-09-08T14:14:41Z",
                "reason": "ProgressDeadlineExceeded",
                "message": "ReplicaSet \"admission-webhook-5c79b59f86\" has timed out progressing."
              }
            ]
          }
        }
        
        For more information, see https://g.co/cloud/acm-errors#knv9999

@sdowell , we have two options here.

I can set parallel to something like 20 to figure out the max parallel setting the pool currently supports;
we can change the node type to see how it goes. We can try c2-standard-60 or some type from C2D machine series: https://cloud.google.com/compute/docs/compute-optimized-machines#c2d-high-mem.

I personally prefer the second option. WDYT?

Sep 08 '22 14:09 haiyanmeng

Perhaps we can create another PR with no changes as a baseline comparison?

We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo

@haiyanmeng Only issue I see with looking at the job history is that any given job could include changes to the job itself (more tests, functional changes, etc). I suggested created a PR from the same base as this one so that we can confident we have an accurate control when comparing the difference in performance

Sep 08 '22 16:09 sdowell

@haiyanmeng I suspect the second optional is the viable option if we really want to start pushing the scale of the test parallelism. My only concern is that we are already pretty aggressive on the hardware requirements to run the presubmit tests. I'm not sure how big of a concern that cost is to us, but this direction does increase test infra costs. This becomes a greater concern if we want to start scaling horizontally to support more consecutive presubmit jobs.

I think it may warrant taking a step back and asking whether we want to throw more hardware at the problem

Sep 08 '22 16:09 sdowell

/retest

Sep 08 '22 20:09 haiyanmeng

/retest

Sep 09 '22 00:09 haiyanmeng

/retest

Sep 09 '22 21:09 haiyanmeng

/retest

Sep 09 '22 21:09 haiyanmeng

/retest

Sep 09 '22 22:09 haiyanmeng

/retest

Sep 09 '22 23:09 haiyanmeng

/retest

Sep 11 '22 23:09 haiyanmeng

/retest

Sep 11 '22 23:09 haiyanmeng

/retest

Sep 12 '22 01:09 haiyanmeng

/retest

Sep 12 '22 15:09 haiyanmeng

/retest

Sep 12 '22 20:09 haiyanmeng

/retest

Sep 12 '22 22:09 haiyanmeng

/retest

Sep 12 '22 23:09 haiyanmeng

/retest

Sep 13 '22 00:09 haiyanmeng

/retest

Sep 13 '22 01:09 haiyanmeng

/retest

Sep 13 '22 13:09 haiyanmeng

@haiyanmeng: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit	65fb86b8697946b469cb37f0ea7fd2c3caacb3ab	link	true	`/test kpt-config-sync-presubmit`
kpt-config-sync-presubmit-e2e-mono-repo	65fb86b8697946b469cb37f0ea7fd2c3caacb3ab	link	true	`/test kpt-config-sync-presubmit-e2e-mono-repo`
kpt-config-sync-presubmit-e2e-multi-repo	65fb86b8697946b469cb37f0ea7fd2c3caacb3ab	link	true	`/test kpt-config-sync-presubmit-e2e-multi-repo`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sep 14 '22 20:09 google-oss-prow[bot]

should we just close this?

Apr 19 '23 15:04 mikebz