etcd Reduce the time e2e tests take

What would you like to be added?

E2e tests are taking over 40 minutes now, we should keep them unver 30 minutes or even lower.

For the top test candidates to optimize use https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64&graph-metrics=test-duration-minutes

Current state

go.etcd.io/etcd/tests/v3/common.TestMemberAdd > 5 minutes
go.etcd.io/etcd/tests/v3/common.TestMemberRemove > 2 minutes
go.etcd.io/etcd/tests/v3/e2e.TestAuthority > 1.5 minutes
go.etcd.io/etcd/tests/v3/e2e.TestRuntimeReconfigGrowClusterSize > 1 minute
go.etcd.io/etcd/tests/v3/common.TestUserAdd_Simple > 1 minute
go.etcd.io/etcd/tests/v3/common.TestLeaseGrantAndList ~ 50 seconds
go.etcd.io/etcd/tests/v3/e2e.TestVerifyHashKVAfterCompact ~50 seconds
go.etcd.io/etcd/tests/v3/e2e.TestMixVersionsSnapshotByMockingPartition ~ 40 seconds
go.etcd.io/etcd/tests/v3/common.TestKVDelete ~ 40 seconds
go.etcd.io/etcd/tests/v3/common.TestLeaseGrantTimeToLiveExpired ~ 35 seconds
go.etcd.io/etcd/tests/v3/e2e.TestWatchDelayForManualProgressNotification ~ 35 seconds ...

Looking for someone to take a look at the top tests and propose improvements to reduce runtime

Why is this needed?

We should keep the test execution time low to ensure good contributor expirience.

Dec 02 '24 18:12 serathius

/cc @ahrtr @jmhbnz @ivanvc /help

Dec 02 '24 18:12 serathius

@serathius I can take a look at this

Dec 04 '24 19:12 manthanguptaa

Thanks, @manthanguptaa!

/assign @manthanguptaa

Dec 05 '24 22:12 ivanvc

Hi @manthanguptaa, I'm just checking in to see if you're still working on this.

Thanks.

Feb 18 '25 19:02 ivanvc

Hey @ivanvc! I really appreciate you checking in, and I’m so sorry for not getting this to you on time. Things have been quite hectic on my end with work, and unfortunately, I won’t be able to take this up right now. I truly appreciate your patience and understanding. I hope to take up something in the future to work

Feb 20 '25 12:02 manthanguptaa

Nudging @AwesomePatrol. I see that you worked on the draft PR #19035. Would you be interested in formally being the assignee of this issue?

Feb 21 '25 06:02 ivanvc

When I started https://github.com/etcd-io/etcd/pull/19035 I hoped that parallelization would give a few easy wins. However, when implemented, the following started to be apparent:

Many parts of the framework need to be refactored to make sure they don't conflict
Runners need to be scaled up as currently they will OOM or cause flakes (many tests are execution-speed-dependent)
New tests would need to conform to new practices (like allocating ports from the same pool) - it is easy to miss (as it most likely would cause flakes instead of hard failures) and hard to enforce

Given all this, I think that 40m for e2e tests is not that bad. I am open to suggestions on how to tackle this problem

Feb 21 '25 11:02 AwesomePatrol

Think we could reduce number of test cases in e2e scenario.

https://github.com/etcd-io/etcd/blob/49f34c9751cc4e90456af612eae42d73f3709473/tests/common/e2e_test.go#L31-L75

The list of scenarios is broad, as some e2e tests were tested in all cases. However, it didn't account for single tests taking long time. I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.

Feb 21 '25 11:02 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

May 19 '25 00:05 github-actions[bot]

Another solution would be to split test scenarios. We don't need to run possible scenarios to test MemberAdd on presubmits. Maybe we could run just one case NoTLS and delegate all others to periodics.

May 19 '25 07:05 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Jul 19 '25 00:07 github-actions[bot]

Since creation of issue, the test duration has grown to over 50 minutes. I like the idea of @yagikota (https://github.com/etcd-io/etcd/issues/13637#issuecomment-3262005992) to use environment variable to reduce number of scenarios of e2e tests we run. This way we could reduce duration of presubmits, which we should keep under 30 minutes.

In presubmits we could just run NoTLS scenario for tests migrated to common framework. NoTLS is the fastest scenario taking almost 1/3 of what PeerAutoTLS. That would allow us to reduce runtime by 10 times.

Periodics stay the same, might need to configure some alerting if the periods start consistently failing.

@yagikota would you be interested in implementing this?

Sep 08 '25 08:09 serathius

cc @ivanvc @jmhbnz @fuweid @ahrtr for opinion.

Sep 08 '25 09:09 serathius

would you be interested in implementing this?

Yes. However, I'm concerned that if we run tests only for NoTLS scenario in presubmits, we cannot catch issues in other scenarios during test implementation.

A likely situation would be:

Contributor creates a PR with new tests
Memtainer adds /ok-to-test label to run presubmits tests(only NoTLS scenario)
All tests pass
Memtainer merges the PR
The postsubmits/periodics tests(all scenarios) are triggered
Some tests fail
We need a follow-up PR to fix the issue

Sep 08 '25 14:09 yagikota

That's why we will still run other scenarios for TLS in integration tests.

Sep 08 '25 15:09 serathius

That makes sense.

Sep 08 '25 15:09 yagikota

If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs. I think splitting makes sense, and we can experiment and find alternatives once we have the split.

Sep 09 '25 04:09 ivanvc

If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs.

That's should be property of any test. However I would really prefer to not implement test sharding in Go. We would ad a lot of complexity just to address synthetic problem. Trust me, we have test sharding at Google and it's horrible. Adds mental overhead to every debugging operation.

One of the reason that e2e test duration has been growing is fact that common tests run many more cluster setup scenarios (NoTLS, PeerTLS, PeerAutoTLS, ClientTLS, ClientAutoTLS, MinorityLastVersion, QuorumLastVersion) than normal e2e tests. While some e2e tests had scenarios like (NoTLS and ClientTLS), migrating a test to common framework can double or triple of number of scenarios. Disconnecting cluster setups from tests, is one of major goals of test unification effort, and I hope we can add more cluster setups in the future. However this means that now we have NxM test matrix and assuming that we can just continue growing number of tests is unrealistic.

Testing all possible combinations of tests and cluster setups is just too costly, and not useful in 99% of PR. That's why I propose to disconnect the presubmit vs periodic testing. Please remember about the integration tests, they are the enabler here. While we might be skipping the e2e tests for ClientTLS on presubmit, we will still run a same tests exact with ClientTLS just running as integration test. Integration tests can do exactly the same test just run embedded etcd instead of etcd binary The only difference between integration and e2e tests is how configuration is passed (embed.Config vs command line flags).

Sep 09 '25 07:09 serathius

https://github.com/etcd-io/etcd/issues/18983#issuecomment-2674324679

I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.

I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.

Sep 09 '25 08:09 ahrtr

cc. @serathius @ivanvc @ahrtr

Seems like no one is opposed to reducing e2e test scenarios using an environment variable.

If we use an environment variable to selectively narrow the test scenarios, we have some options:

E2E_TEST_MINIMAL: false (default) | true
- false: Run for all scenarios
- true: Run only for the specific scenarios
- Pros: Simple
- Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better
E2E_TEST_PROFILE: full (default) | minimal
- full: Run for all scenarios
- minimal: Run only for the specific scenarios
- Pros: Can easily add new profiles in the future
- Cons: Potential drift if profiles aren’t kept up-to-date / Need to maintain definitions of each profile
E2E_TEST_SCENARIOS: NoTLS,PeerAutoTLS,...
- Comma-separated list of scenario names to run
- All scenarios by default
- Pros: Doesn’t require predefined profiles / Most flexible
- Cons: Hard to maintain if scenarios are added or deleted / Each job must spell out the same scenario list (easy to drift)

In my opinion, option (2), which is better than option (1), is best. Option (3) feels like overkill for us.

Next, I want to discuss which scenarios should go into "minimal".

I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.

We can probably start implementation and then discuss it based on the result, as @ahrtr mentioned.

What do you think?

Sep 13 '25 16:09 yagikota

E2E_TEST_MINIMAL: false (default) | true

false: Run for all scenarios

true: Run only for the specific scenarios

Pros: Simple

Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better

Works for me.

In normal presubmit workflow, just run minimal test scenarios;
In periodic workflow, run full test scenarios.

Sep 13 '25 17:09 ahrtr

@ahrtr @serathius Can I start working on this issue based on the discussion so far?

Sep 19 '25 13:09 yagikota

@ahrtr @serathius Can I start working on this issue based on the discussion so far?

Pls feel free to deliver a PR, and reviewers can raise comments under your PR if any. thx

Sep 19 '25 13:09 ahrtr

/assign

Sep 19 '25 13:09 yagikota

Now, these two PRs are merged

https://github.com/etcd-io/etcd/pull/20733
https://github.com/kubernetes/test-infra/pull/35628

I hope this improvement reduces e2e test CI duration by around 15 minutes (from 57m to 42m) in presubmits.

Let's wait for several days and check result: https://testgrid.k8s.io/sig-etcd-presubmits#pull-etcd-e2e-amd64&graph-metrics=test-duration-minutes&show-stale-tests=

Oct 02 '25 15:10 yagikota

Looks like it worked, great work!

One followup, as some tests (mixed cluster) are no longer running in presubmit, only as periodic, we should make sure that they are monitored. Options:

Expand release process documentation to check the dashboard periodics. Possibly create a dedicated dashboard for release blocking periodic test like K8s has. https://testgrid.k8s.io/sig-release-master-blocking#Summary
Configure sending an email if there are consistent failures. https://github.com/kubernetes/test-infra/blob/51591233c897a872b935908858ac68c5577d10d5/config/jobs/kubernetes/sig-release/release-branch-jobs/1.31.yaml#L184-L186

Oct 03 '25 07:10 serathius

@serathius I don't understand the merit of creating a new dedicated dashboard (e.g., sig-etcd-blocking). We can use existing dashboard(sig-etcd-periodics) and confugire notificatioin for all jobs in sig-etcd-periodics. 🤔

Also, if we configure email sending, do we need to create an mail group (e.g., [email protected]) instead of sending to existing group, avoiding noise?

Oct 04 '25 03:10 yagikota

@ivanvc @ahrtr

Do you have any idea? ☝️

Oct 10 '25 13:10 yagikota

@ivanvc @ahrtr

Do you have any idea? ☝️

I think we need to update release process to check the periodic jobs, defer to @ivanvc ad @jmhbnz

Oct 10 '25 18:10 ahrtr

I think we need to update release process to check the periodic jobs

Any steps or policies to update the release process? I'm not familiar with the etcd release process, so I’ll follow the owners’ judgment.

Oct 11 '25 06:10 yagikota