Increase the `parallel` flag of e2e presubmit jobs
Increase the parallel flag from 18 to 24 to reduce the excution time of the kpt-config-sync-presubmit-e2e-multi-repo job, which runs in the large-job-pool and each node in the pool has 30 vCPUs.
Each vCPU corresponds to a hardware thread rather than a core, this PR is to figure out whether setting the parallel flag to 24 could reduce the execution time.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: haiyanmeng
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [haiyanmeng]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/cc @sdowell
SGTM, let's profile test runs on this PR. Perhaps we can create another PR with no changes as a baseline comparison?
/hold
/retest
Perhaps we can create another PR with no changes as a baseline comparison?
We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo
/retest
Setting the parallel flag to 21 causes lots of tests to fail. Here are the errors:
new.go:416: ERROR: waiting for ConfigSync Deployments to become available: 2 error(s)
[1] KNV9999: deployments.apps "root-reconciler" not found
For more information, see https://g.co/cloud/acm-errors#knv9999
[2] KNV9999: failed predicate for deployment/admission-webhook in namespace config-management-system: got status Failed, want Current
{
"metadata": {
"name": "admission-webhook",
"namespace": "config-management-system",
...
"status": {
"observedGeneration": 1,
"replicas": 2,
"updatedReplicas": 2,
"unavailableReplicas": 2,
"conditions": [
{
"type": "Available",
"status": "False",
"lastUpdateTime": "2022-09-08T14:04:40Z",
"lastTransitionTime": "2022-09-08T14:04:40Z",
"reason": "MinimumReplicasUnavailable",
"message": "Deployment does not have minimum availability."
},
{
"type": "Progressing",
"status": "False",
"lastUpdateTime": "2022-09-08T14:14:41Z",
"lastTransitionTime": "2022-09-08T14:14:41Z",
"reason": "ProgressDeadlineExceeded",
"message": "ReplicaSet \"admission-webhook-5c79b59f86\" has timed out progressing."
}
]
}
}
For more information, see https://g.co/cloud/acm-errors#knv9999
@sdowell , we have two options here.
- I can set
parallelto something like20to figure out the max parallel setting the pool currently supports; - we can change the node type to see how it goes. We can try
c2-standard-60or some type from C2D machine series: https://cloud.google.com/compute/docs/compute-optimized-machines#c2d-high-mem.
I personally prefer the second option. WDYT?
Perhaps we can create another PR with no changes as a baseline comparison?
We can compare the performance based on the presubmit job history: https://oss.gprow.dev/job-history/gs/oss-prow-build-kpt-config-sync/pr-logs/directory/kpt-config-sync-presubmit-e2e-multi-repo
@haiyanmeng Only issue I see with looking at the job history is that any given job could include changes to the job itself (more tests, functional changes, etc). I suggested created a PR from the same base as this one so that we can confident we have an accurate control when comparing the difference in performance
@haiyanmeng I suspect the second optional is the viable option if we really want to start pushing the scale of the test parallelism. My only concern is that we are already pretty aggressive on the hardware requirements to run the presubmit tests. I'm not sure how big of a concern that cost is to us, but this direction does increase test infra costs. This becomes a greater concern if we want to start scaling horizontally to support more consecutive presubmit jobs.
I think it may warrant taking a step back and asking whether we want to throw more hardware at the problem
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
/retest
@haiyanmeng: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| kpt-config-sync-presubmit | 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab | link | true | /test kpt-config-sync-presubmit |
| kpt-config-sync-presubmit-e2e-mono-repo | 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab | link | true | /test kpt-config-sync-presubmit-e2e-mono-repo |
| kpt-config-sync-presubmit-e2e-multi-repo | 65fb86b8697946b469cb37f0ea7fd2c3caacb3ab | link | true | /test kpt-config-sync-presubmit-e2e-multi-repo |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
should we just close this?