Reduce the time e2e tests take
What would you like to be added?
E2e tests are taking over 40 minutes now, we should keep them unver 30 minutes or even lower.
For the top test candidates to optimize use https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64&graph-metrics=test-duration-minutes
Current state
- go.etcd.io/etcd/tests/v3/common.TestMemberAdd > 5 minutes
- go.etcd.io/etcd/tests/v3/common.TestMemberRemove > 2 minutes
- go.etcd.io/etcd/tests/v3/e2e.TestAuthority > 1.5 minutes
- go.etcd.io/etcd/tests/v3/e2e.TestRuntimeReconfigGrowClusterSize > 1 minute
- go.etcd.io/etcd/tests/v3/common.TestUserAdd_Simple > 1 minute
- go.etcd.io/etcd/tests/v3/common.TestLeaseGrantAndList ~ 50 seconds
- go.etcd.io/etcd/tests/v3/e2e.TestVerifyHashKVAfterCompact ~50 seconds
- go.etcd.io/etcd/tests/v3/e2e.TestMixVersionsSnapshotByMockingPartition ~ 40 seconds
- go.etcd.io/etcd/tests/v3/common.TestKVDelete ~ 40 seconds
- go.etcd.io/etcd/tests/v3/common.TestLeaseGrantTimeToLiveExpired ~ 35 seconds
- go.etcd.io/etcd/tests/v3/e2e.TestWatchDelayForManualProgressNotification ~ 35 seconds ...
Looking for someone to take a look at the top tests and propose improvements to reduce runtime
Why is this needed?
We should keep the test execution time low to ensure good contributor expirience.
/cc @ahrtr @jmhbnz @ivanvc /help
@serathius I can take a look at this
Thanks, @manthanguptaa!
/assign @manthanguptaa
Hi @manthanguptaa, I'm just checking in to see if you're still working on this.
Thanks.
Hey @ivanvc! I really appreciate you checking in, and I’m so sorry for not getting this to you on time. Things have been quite hectic on my end with work, and unfortunately, I won’t be able to take this up right now. I truly appreciate your patience and understanding. I hope to take up something in the future to work
Nudging @AwesomePatrol. I see that you worked on the draft PR #19035. Would you be interested in formally being the assignee of this issue?
When I started https://github.com/etcd-io/etcd/pull/19035 I hoped that parallelization would give a few easy wins. However, when implemented, the following started to be apparent:
- Many parts of the framework need to be refactored to make sure they don't conflict
- Runners need to be scaled up as currently they will OOM or cause flakes (many tests are execution-speed-dependent)
- New tests would need to conform to new practices (like allocating ports from the same pool) - it is easy to miss (as it most likely would cause flakes instead of hard failures) and hard to enforce
Given all this, I think that 40m for e2e tests is not that bad. I am open to suggestions on how to tackle this problem
Think we could reduce number of test cases in e2e scenario.
https://github.com/etcd-io/etcd/blob/49f34c9751cc4e90456af612eae42d73f3709473/tests/common/e2e_test.go#L31-L75
The list of scenarios is broad, as some e2e tests were tested in all cases. However, it didn't account for single tests taking long time. I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Another solution would be to split test scenarios. We don't need to run possible scenarios to test MemberAdd on presubmits. Maybe we could run just one case NoTLS and delegate all others to periodics.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Since creation of issue, the test duration has grown to over 50 minutes. I like the idea of @yagikota (https://github.com/etcd-io/etcd/issues/13637#issuecomment-3262005992) to use environment variable to reduce number of scenarios of e2e tests we run. This way we could reduce duration of presubmits, which we should keep under 30 minutes.
In presubmits we could just run NoTLS scenario for tests migrated to common framework. NoTLS is the fastest scenario taking almost 1/3 of what PeerAutoTLS. That would allow us to reduce runtime by 10 times.
Periodics stay the same, might need to configure some alerting if the periods start consistently failing.
@yagikota would you be interested in implementing this?
cc @ivanvc @jmhbnz @fuweid @ahrtr for opinion.
would you be interested in implementing this?
Yes. However, I'm concerned that if we run tests only for NoTLS scenario in presubmits, we cannot catch issues in other scenarios during test implementation.
A likely situation would be:
- Contributor creates a PR with new tests
- Memtainer adds
/ok-to-testlabel to runpresubmitstests(onlyNoTLSscenario) - All tests pass
- Memtainer merges the PR
- The
postsubmits/periodicstests(all scenarios) are triggered - Some tests fail
- We need a follow-up PR to fix the issue
That's why we will still run other scenarios for TLS in integration tests.
That makes sense.
If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs. I think splitting makes sense, and we can experiment and find alternatives once we have the split.
If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs.
That's should be property of any test. However I would really prefer to not implement test sharding in Go. We would ad a lot of complexity just to address synthetic problem. Trust me, we have test sharding at Google and it's horrible. Adds mental overhead to every debugging operation.
One of the reason that e2e test duration has been growing is fact that common tests run many more cluster setup scenarios (NoTLS, PeerTLS, PeerAutoTLS, ClientTLS, ClientAutoTLS, MinorityLastVersion, QuorumLastVersion) than normal e2e tests. While some e2e tests had scenarios like (NoTLS and ClientTLS), migrating a test to common framework can double or triple of number of scenarios. Disconnecting cluster setups from tests, is one of major goals of test unification effort, and I hope we can add more cluster setups in the future. However this means that now we have NxM test matrix and assuming that we can just continue growing number of tests is unrealistic.
Testing all possible combinations of tests and cluster setups is just too costly, and not useful in 99% of PR. That's why I propose to disconnect the presubmit vs periodic testing. Please remember about the integration tests, they are the enabler here. While we might be skipping the e2e tests for ClientTLS on presubmit, we will still run a same tests exact with ClientTLS just running as integration test. Integration tests can do exactly the same test just run embedded etcd instead of etcd binary The only difference between integration and e2e tests is how configuration is passed (embed.Config vs command line flags).
https://github.com/etcd-io/etcd/issues/18983#issuecomment-2674324679
I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.
I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.
cc. @serathius @ivanvc @ahrtr
Seems like no one is opposed to reducing e2e test scenarios using an environment variable.
If we use an environment variable to selectively narrow the test scenarios, we have some options:
-
E2E_TEST_MINIMAL: false (default) | true-
false: Run for all scenarios -
true: Run only for the specific scenarios - Pros: Simple
- Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better
-
-
E2E_TEST_PROFILE: full (default) | minimal-
full: Run for all scenarios -
minimal: Run only for the specific scenarios - Pros: Can easily add new profiles in the future
- Cons: Potential drift if profiles aren’t kept up-to-date / Need to maintain definitions of each profile
-
-
E2E_TEST_SCENARIOS: NoTLS,PeerAutoTLS,...- Comma-separated list of scenario names to run
- All scenarios by default
- Pros: Doesn’t require predefined profiles / Most flexible
- Cons: Hard to maintain if scenarios are added or deleted / Each job must spell out the same scenario list (easy to drift)
In my opinion, option (2), which is better than option (1), is best. Option (3) feels like overkill for us.
Next, I want to discuss which scenarios should go into "minimal".
I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.
We can probably start implementation and then discuss it based on the result, as @ahrtr mentioned.
What do you think?
E2E_TEST_MINIMAL: false (default) | true
false: Run for all scenariostrue: Run only for the specific scenarios- Pros: Simple
- Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better
Works for me.
- In normal presubmit workflow, just run minimal test scenarios;
- In periodic workflow, run full test scenarios.
@ahrtr @serathius Can I start working on this issue based on the discussion so far?
@ahrtr @serathius Can I start working on this issue based on the discussion so far?
Pls feel free to deliver a PR, and reviewers can raise comments under your PR if any. thx
/assign
Now, these two PRs are merged
- https://github.com/etcd-io/etcd/pull/20733
- https://github.com/kubernetes/test-infra/pull/35628
I hope this improvement reduces e2e test CI duration by around 15 minutes (from 57m to 42m) in presubmits.
Let's wait for several days and check result: https://testgrid.k8s.io/sig-etcd-presubmits#pull-etcd-e2e-amd64&graph-metrics=test-duration-minutes&show-stale-tests=
Looks like it worked, great work!
One followup, as some tests (mixed cluster) are no longer running in presubmit, only as periodic, we should make sure that they are monitored. Options:
- Expand release process documentation to check the dashboard periodics. Possibly create a dedicated dashboard for release blocking periodic test like K8s has. https://testgrid.k8s.io/sig-release-master-blocking#Summary
- Configure sending an email if there are consistent failures. https://github.com/kubernetes/test-infra/blob/51591233c897a872b935908858ac68c5577d10d5/config/jobs/kubernetes/sig-release/release-branch-jobs/1.31.yaml#L184-L186
@serathius
I don't understand the merit of creating a new dedicated dashboard (e.g., sig-etcd-blocking).
We can use existing dashboard(sig-etcd-periodics) and confugire notificatioin for all jobs in sig-etcd-periodics. 🤔
Also, if we configure email sending, do we need to create an mail group (e.g., [email protected]) instead of sending to existing group, avoiding noise?
@ivanvc @ahrtr
Do you have any idea? ☝️
Do you have any idea? ☝️
I think we need to update release process to check the periodic jobs, defer to @ivanvc ad @jmhbnz
I think we need to update release process to check the periodic jobs
Any steps or policies to update the release process? I'm not familiar with the etcd release process, so I’ll follow the owners’ judgment.