etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Reduce the time e2e tests take

Open serathius opened this issue 1 year ago • 33 comments

What would you like to be added?

E2e tests are taking over 40 minutes now, we should keep them unver 30 minutes or even lower.

For the top test candidates to optimize use https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64&graph-metrics=test-duration-minutes

Current state

  • go.etcd.io/etcd/tests/v3/common.TestMemberAdd > 5 minutes
  • go.etcd.io/etcd/tests/v3/common.TestMemberRemove > 2 minutes
  • go.etcd.io/etcd/tests/v3/e2e.TestAuthority > 1.5 minutes
  • go.etcd.io/etcd/tests/v3/e2e.TestRuntimeReconfigGrowClusterSize > 1 minute
  • go.etcd.io/etcd/tests/v3/common.TestUserAdd_Simple > 1 minute
  • go.etcd.io/etcd/tests/v3/common.TestLeaseGrantAndList ~ 50 seconds
  • go.etcd.io/etcd/tests/v3/e2e.TestVerifyHashKVAfterCompact ~50 seconds
  • go.etcd.io/etcd/tests/v3/e2e.TestMixVersionsSnapshotByMockingPartition ~ 40 seconds
  • go.etcd.io/etcd/tests/v3/common.TestKVDelete ~ 40 seconds
  • go.etcd.io/etcd/tests/v3/common.TestLeaseGrantTimeToLiveExpired ~ 35 seconds
  • go.etcd.io/etcd/tests/v3/e2e.TestWatchDelayForManualProgressNotification ~ 35 seconds ...

Looking for someone to take a look at the top tests and propose improvements to reduce runtime

image

Why is this needed?

We should keep the test execution time low to ensure good contributor expirience.

serathius avatar Dec 02 '24 18:12 serathius

/cc @ahrtr @jmhbnz @ivanvc /help

serathius avatar Dec 02 '24 18:12 serathius

@serathius I can take a look at this

manthanguptaa avatar Dec 04 '24 19:12 manthanguptaa

Thanks, @manthanguptaa!

/assign @manthanguptaa

ivanvc avatar Dec 05 '24 22:12 ivanvc

Hi @manthanguptaa, I'm just checking in to see if you're still working on this.

Thanks.

ivanvc avatar Feb 18 '25 19:02 ivanvc

Hey @ivanvc! I really appreciate you checking in, and I’m so sorry for not getting this to you on time. Things have been quite hectic on my end with work, and unfortunately, I won’t be able to take this up right now. I truly appreciate your patience and understanding. I hope to take up something in the future to work

manthanguptaa avatar Feb 20 '25 12:02 manthanguptaa

Nudging @AwesomePatrol. I see that you worked on the draft PR #19035. Would you be interested in formally being the assignee of this issue?

ivanvc avatar Feb 21 '25 06:02 ivanvc

When I started https://github.com/etcd-io/etcd/pull/19035 I hoped that parallelization would give a few easy wins. However, when implemented, the following started to be apparent:

  1. Many parts of the framework need to be refactored to make sure they don't conflict
  2. Runners need to be scaled up as currently they will OOM or cause flakes (many tests are execution-speed-dependent)
  3. New tests would need to conform to new practices (like allocating ports from the same pool) - it is easy to miss (as it most likely would cause flakes instead of hard failures) and hard to enforce

Given all this, I think that 40m for e2e tests is not that bad. I am open to suggestions on how to tackle this problem

AwesomePatrol avatar Feb 21 '25 11:02 AwesomePatrol

Think we could reduce number of test cases in e2e scenario.

https://github.com/etcd-io/etcd/blob/49f34c9751cc4e90456af612eae42d73f3709473/tests/common/e2e_test.go#L31-L75

The list of scenarios is broad, as some e2e tests were tested in all cases. However, it didn't account for single tests taking long time. I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.

serathius avatar Feb 21 '25 11:02 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 19 '25 00:05 github-actions[bot]

Another solution would be to split test scenarios. We don't need to run possible scenarios to test MemberAdd on presubmits. Maybe we could run just one case NoTLS and delegate all others to periodics.

serathius avatar May 19 '25 07:05 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 19 '25 00:07 github-actions[bot]

Since creation of issue, the test duration has grown to over 50 minutes. I like the idea of @yagikota (https://github.com/etcd-io/etcd/issues/13637#issuecomment-3262005992) to use environment variable to reduce number of scenarios of e2e tests we run. This way we could reduce duration of presubmits, which we should keep under 30 minutes.

In presubmits we could just run NoTLS scenario for tests migrated to common framework. NoTLS is the fastest scenario taking almost 1/3 of what PeerAutoTLS. That would allow us to reduce runtime by 10 times.

Image

Periodics stay the same, might need to configure some alerting if the periods start consistently failing.

@yagikota would you be interested in implementing this?

serathius avatar Sep 08 '25 08:09 serathius

cc @ivanvc @jmhbnz @fuweid @ahrtr for opinion.

serathius avatar Sep 08 '25 09:09 serathius

would you be interested in implementing this?

Yes. However, I'm concerned that if we run tests only for NoTLS scenario in presubmits, we cannot catch issues in other scenarios during test implementation.

A likely situation would be:

  1. Contributor creates a PR with new tests
  2. Memtainer adds /ok-to-test label to run presubmits tests(only NoTLS scenario)
  3. All tests pass
  4. Memtainer merges the PR
  5. The postsubmits/periodics tests(all scenarios) are triggered
  6. Some tests fail
  7. We need a follow-up PR to fix the issue

yagikota avatar Sep 08 '25 14:09 yagikota

That's why we will still run other scenarios for TLS in integration tests.

serathius avatar Sep 08 '25 15:09 serathius

That makes sense.

yagikota avatar Sep 08 '25 15:09 yagikota

If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs. I think splitting makes sense, and we can experiment and find alternatives once we have the split.

ivanvc avatar Sep 09 '25 04:09 ivanvc

If we're able to run NoTLS and PeerAutoTLS isolated, we could run the e2e tests in parallel by using two different Prow jobs.

That's should be property of any test. However I would really prefer to not implement test sharding in Go. We would ad a lot of complexity just to address synthetic problem. Trust me, we have test sharding at Google and it's horrible. Adds mental overhead to every debugging operation.

One of the reason that e2e test duration has been growing is fact that common tests run many more cluster setup scenarios (NoTLS, PeerTLS, PeerAutoTLS, ClientTLS, ClientAutoTLS, MinorityLastVersion, QuorumLastVersion) than normal e2e tests. While some e2e tests had scenarios like (NoTLS and ClientTLS), migrating a test to common framework can double or triple of number of scenarios. Disconnecting cluster setups from tests, is one of major goals of test unification effort, and I hope we can add more cluster setups in the future. However this means that now we have NxM test matrix and assuming that we can just continue growing number of tests is unrealistic.

Testing all possible combinations of tests and cluster setups is just too costly, and not useful in 99% of PR. That's why I propose to disconnect the presubmit vs periodic testing. Please remember about the integration tests, they are the enabler here. While we might be skipping the e2e tests for ClientTLS on presubmit, we will still run a same tests exact with ClientTLS just running as integration test. Integration tests can do exactly the same test just run embedded etcd instead of etcd binary The only difference between integration and e2e tests is how configuration is passed (embed.Config vs command line flags).

serathius avatar Sep 09 '25 07:09 serathius

https://github.com/etcd-io/etcd/issues/18983#issuecomment-2674324679

I think we can just reduce the scenarios for Membership tests as they have 5 second wait due to etcd requing 5 second of cluster stability before allowing membership changes.

I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.

ahrtr avatar Sep 09 '25 08:09 ahrtr

cc. @serathius @ivanvc @ahrtr

Seems like no one is opposed to reducing e2e test scenarios using an environment variable.

If we use an environment variable to selectively narrow the test scenarios, we have some options:

  1. E2E_TEST_MINIMAL: false (default) | true
    • false: Run for all scenarios
    • true: Run only for the specific scenarios
    • Pros: Simple
    • Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better
  2. E2E_TEST_PROFILE: full (default) | minimal
    • full: Run for all scenarios
    • minimal: Run only for the specific scenarios
    • Pros: Can easily add new profiles in the future
    • Cons: Potential drift if profiles aren’t kept up-to-date / Need to maintain definitions of each profile
  3. E2E_TEST_SCENARIOS: NoTLS,PeerAutoTLS,...
    • Comma-separated list of scenario names to run
    • All scenarios by default
    • Pros: Doesn’t require predefined profiles / Most flexible
    • Cons: Hard to maintain if scenarios are added or deleted / Each job must spell out the same scenario list (easy to drift)

In my opinion, option (2), which is better than option (1), is best. Option (3) feels like overkill for us.

Next, I want to discuss which scenarios should go into "minimal".

I think this is a low-hang fruit. For example, we can just keep PeerTLS and ClientTLS in the list, and let's see how much time we can save.

We can probably start implementation and then discuss it based on the result, as @ahrtr mentioned.

What do you think?

yagikota avatar Sep 13 '25 16:09 yagikota

E2E_TEST_MINIMAL: false (default) | true

  • false: Run for all scenarios
  • true: Run only for the specific scenarios
  • Pros: Simple
  • Cons: Only supports two modes (“minimal” vs. “full”) -> (2) is better

Works for me.

  • In normal presubmit workflow, just run minimal test scenarios;
  • In periodic workflow, run full test scenarios.

ahrtr avatar Sep 13 '25 17:09 ahrtr

@ahrtr @serathius Can I start working on this issue based on the discussion so far?

yagikota avatar Sep 19 '25 13:09 yagikota

@ahrtr @serathius Can I start working on this issue based on the discussion so far?

Pls feel free to deliver a PR, and reviewers can raise comments under your PR if any. thx

ahrtr avatar Sep 19 '25 13:09 ahrtr

/assign

yagikota avatar Sep 19 '25 13:09 yagikota

Now, these two PRs are merged

  • https://github.com/etcd-io/etcd/pull/20733
  • https://github.com/kubernetes/test-infra/pull/35628

I hope this improvement reduces e2e test CI duration by around 15 minutes (from 57m to 42m) in presubmits.

Let's wait for several days and check result: https://testgrid.k8s.io/sig-etcd-presubmits#pull-etcd-e2e-amd64&graph-metrics=test-duration-minutes&show-stale-tests=

yagikota avatar Oct 02 '25 15:10 yagikota

Looks like it worked, great work!

Image

One followup, as some tests (mixed cluster) are no longer running in presubmit, only as periodic, we should make sure that they are monitored. Options:

  • Expand release process documentation to check the dashboard periodics. Possibly create a dedicated dashboard for release blocking periodic test like K8s has. https://testgrid.k8s.io/sig-release-master-blocking#Summary
  • Configure sending an email if there are consistent failures. https://github.com/kubernetes/test-infra/blob/51591233c897a872b935908858ac68c5577d10d5/config/jobs/kubernetes/sig-release/release-branch-jobs/1.31.yaml#L184-L186

serathius avatar Oct 03 '25 07:10 serathius

@serathius I don't understand the merit of creating a new dedicated dashboard (e.g., sig-etcd-blocking). We can use existing dashboard(sig-etcd-periodics) and confugire notificatioin for all jobs in sig-etcd-periodics. 🤔

Also, if we configure email sending, do we need to create an mail group (e.g., [email protected]) instead of sending to existing group, avoiding noise?

yagikota avatar Oct 04 '25 03:10 yagikota

@ivanvc @ahrtr

Do you have any idea? ☝️

yagikota avatar Oct 10 '25 13:10 yagikota

@ivanvc @ahrtr

Do you have any idea? ☝️

I think we need to update release process to check the periodic jobs, defer to @ivanvc ad @jmhbnz

ahrtr avatar Oct 10 '25 18:10 ahrtr

I think we need to update release process to check the periodic jobs

Any steps or policies to update the release process? I'm not familiar with the etcd release process, so I’ll follow the owners’ judgment.

yagikota avatar Oct 11 '25 06:10 yagikota