e2e-framework [env] Do not share config between tests

What type of PR is this?

/kind feature

What this PR does / why we need it:

Following discussions in https://github.com/kubernetes-sigs/e2e-framework/pull/367

This PR fixes a bug where the config object was shared between tests. When running tests in parallel, it was then impossible to rely on fields like the namespace because they could have been overwritten by another test. Also, it led to tests using -race to fail because the shared config.klient object that were updating the same fields when initializing the client.

This PR creates a new deepCopyConfig method that allows to create a deep copy of the config object. This way, each test can have its own without impacting the others. Now the config has the following lifecycle:

It is created when a testEnv is created
Each test uses a child testEnv that inherits the main testEnv's config
Each feature uses a child testEnv that inherits the test's testEnv's config

This way, a feature inherits all the changes made in BeforeEach functions while not impacting the other features.

Which issue(s) this PR fixes:

Fixes #352 Fixes #384

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Improve how Config is shared across tests so that it can be used without any issues in parallel tests

Additional documentation e.g., Usage docs, etc.:

Apr 08 '24 16:04 Fricounet

Hi @Fricounet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 08 '24 16:04 k8s-ci-robot

/assign @vladimirvivien

Apr 08 '24 16:04 Fricounet

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Fricounet Once this PR has been reviewed and has the lgtm label, please ask for approval from vladimirvivien. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 12 '24 13:04 k8s-ci-robot

I've tested this with a simple test (from #352) by checking out locally an adding a replace:

replace sigs.k8s.io/e2e-framework => ../e2e-framework

package main

import (
	"context"
	"os"
	"testing"

	corev1 "k8s.io/api/core/v1"
	"sigs.k8s.io/e2e-framework/klient/conf"
	"sigs.k8s.io/e2e-framework/pkg/env"
	"sigs.k8s.io/e2e-framework/pkg/envconf"
	"sigs.k8s.io/e2e-framework/pkg/features"
)

var tenv env.Environment

func TestMain(m *testing.M) {
	cfg, err := envconf.NewFromFlags()
	if err != nil {
		panic(err)
	}
	cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
	tenv = env.NewWithConfig(cfg)

	os.Exit(tenv.Run(m))
}

func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
	res := c.Client().Resources()
	podList := corev1.PodList{}
	err := res.List(ctx, &podList)
	if err != nil {
		t.Fatalf("failed listing pods: %v", err)
	}
	return ctx
}

func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
	return ctx
}

func TestX1(t *testing.T) {
	t.Parallel()
	tenv.Test(t, generateFeat())
}

func TestX2(t *testing.T) {
	t.Parallel()
	tenv.Test(t, generateFeat())
}

func generateFeat() features.Feature {
	return features.New("feature").
		WithSetup("setup", Setup).
		Assess("A", Assess).
		Feature()
}

( I needed to add additional replace directives to work around #399

and I still get a data race:

go test -v -race -count 1 -trimpath .
=== RUN   TestX1
=== PAUSE TestX1
=== RUN   TestX2
=== PAUSE TestX2
=== CONT  TestX2
=== CONT  TestX1
==================
WARNING: DATA RACE
Write at 0x00c00030ea80 by goroutine 26:
  sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
      sigs.k8s.io/[email protected]/pkg/envconf/config.go:141 +0x9c
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
      sigs.k8s.io/[email protected]/pkg/env/env.go:601 +0x100
  sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
      sigs.k8s.io/[email protected]/pkg/env/env.go:130 +0xe4
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
      sigs.k8s.io/[email protected]/pkg/env/env.go:260 +0x48
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
      sigs.k8s.io/[email protected]/pkg/env/env.go:346 +0x60
  e2e-framework-bug-352.TestX1()
      e2e-framework-bug-352/main_test.go:44 +0xb8
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      testing/testing.go:1742 +0x40

Previous write at 0x00c00030ea80 by goroutine 27:
  sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
      sigs.k8s.io/[email protected]/pkg/envconf/config.go:141 +0x9c
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
      sigs.k8s.io/[email protected]/pkg/env/env.go:601 +0x100
  sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
      sigs.k8s.io/[email protected]/pkg/env/env.go:130 +0xe4
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
      sigs.k8s.io/[email protected]/pkg/env/env.go:260 +0x48
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
      sigs.k8s.io/[email protected]/pkg/env/env.go:346 +0x60
  e2e-framework-bug-352.TestX2()
      e2e-framework-bug-352/main_test.go:49 +0xb8
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      testing/testing.go:1742 +0x40

Goroutine 26 (running) created at:
  testing.(*T).Run()
      testing/testing.go:1742 +0x5e4
  testing.runTests.func1()
      testing/testing.go:2161 +0x80
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.runTests()
      testing/testing.go:2159 +0x6e0
  testing.(*M).Run()
      testing/testing.go:2027 +0xb74
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
      sigs.k8s.io/[email protected]/pkg/env/env.go:408 +0x4c0
  e2e-framework-bug-352.TestMain()
      e2e-framework-bug-352/main_test.go:25 +0x250
  main.main()
      _testmain.go:51 +0x294

Goroutine 27 (running) created at:
  testing.(*T).Run()
      testing/testing.go:1742 +0x5e4
  testing.runTests.func1()
      testing/testing.go:2161 +0x80
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.runTests()
      testing/testing.go:2159 +0x6e0
  testing.(*M).Run()
      testing/testing.go:2027 +0xb74
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
      sigs.k8s.io/[email protected]/pkg/env/env.go:408 +0x4c0
  e2e-framework-bug-352.TestMain()
      e2e-framework-bug-352/main_test.go:25 +0x250
  main.main()
      _testmain.go:51 +0x294
==================
=== RUN   TestX2/feature
=== RUN   TestX1/feature
=== NAME  TestX2
    testing.go:1398: race detected during execution of test
=== NAME  TestX1
    testing.go:1398: race detected during execution of test
=== RUN   TestX2/feature/A
=== RUN   TestX1/feature/A
--- FAIL: TestX2 (0.04s)
    --- PASS: TestX2/feature (0.03s)
        --- PASS: TestX2/feature/A (0.00s)
--- FAIL: TestX1 (0.04s)
    --- PASS: TestX1/feature (0.03s)
        --- PASS: TestX1/feature/A (0.00s)
FAIL
FAIL    e2e-framework-bug-352   0.532s
FAIL

Apr 16 '24 10:04 pmalek

@pmalek indeed :thinking: So it's the client again. This stems from the fact that Client() function is not a proper getter function like the other Config accessors but a mix of a setter and a getter. As a result, when it is accessed the first time by both tests, it is nil and they both try to create it at the same time. I can think of a few ways of solving this:

Create a new client getter function like GetClient() which would simply acts as a getter for the client and use it in deepCopyConfig instead of Client()
Change the Client() function to be a proper getter and find a new name for the current Client(), maybe MustNewClient() since it panics in case of error?
Use a lock on the client. What you wanted to do here https://github.com/kubernetes-sigs/e2e-framework/pull/367. This would need some additional changes to make the lock available in the env package but I think that could work

Each has its own quirks:

is the easiest to setup and doesn't break the API. Drawback is that it is not inline with the current setter/getter convention on the Config object
Change the API to have saner and clearer method names. But breaks the API. However, since we are still in versions v0.X.Y, I think breaking the API can be considered fine and arguably, breaking it now to have something more future proof is OK in my opinion
I don't like using locks where we can avoid it especially since this lock will be exported outside of the config pkg to be accessible in env. So anybody would be able to play around with this lock while it should stay an implementation detail

For now, I went with 1) because it is the easiest to do but I'd rather do 2) if we say it is fine to break the API

Apr 16 '24 14:04 Fricounet

I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.

Apr 16 '24 14:04 phisco

I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.

If it's possible to go this path I'd vote for that option. 👍

Apr 16 '24 16:04 pmalek

@phisco @pmalek I got rid of the caching by removing it altogether. Drawback being that the client will be recreated every time it is accessed with config.Client(). I tried to rename my GetClient function to Client in order to have a simple getter that won't recreate the client each time. However it creates issues where the config.client object is never initialized because callers of Client do not expect to have to initialize it themselves and we cannot initialize it ourselves within the New... contructor functions because it happens to early and clusters might not be available yet (see #57). My try is available here if you want to take a look.

In any case, removing the caching altogether seems like an ok solution if we are fine with the performance cost induced by having to recreate the client all the time.

Apr 17 '24 16:04 Fricounet

Drawback being that the client will be recreated every time it is accessed with config.Client()

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

Apr 20 '24 12:04 pmalek

Drawback being that the client will be recreated every time it is accessed with config.Client()

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

+1, being this a testing framework I feel that shouldn't be the focus, but the call is to the maintainers to make for sure.

Apr 20 '24 13:04 phisco

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

I personally don't think this overhead in performance will be that significant if we create client each time. But we should probably keep in mind the resource usage in terms of memory. Just to account for cases where one runs these in a sandbox-ed container env. We don't want them getting OOM'ed suddenly out of nowhere right ?

Apr 21 '24 15:04 harshanarayana

@harshanarayana this PR does indeed create some memory overhead. However, I did some profiling and observed the following:

it only impacts parallel tests since serial tests will only create 1 Config object at a time
each config seems to weighs around 600kB in memory.
- Running 1 test in parallel:
- Running 10 tests in parallel:
- Running 100 tests in parallel:
this has nothing to do with caching the client or not. It is the fact that a new Config object has to be created for each test/feature that creates the overhead (which is the core reason for this PR in the first place).

Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:

serial tests don't have a noticeable overhead
parallel tests can now be run without race conditions on the Config.

What do you think?

If you want to reproduce the profiling I did, I used the following code:

package main

import (
	"context"
	"flag"
	"fmt"
	"os"
	"runtime/pprof"
	"testing"
	"time"

	"sigs.k8s.io/e2e-framework/klient/conf"
	"sigs.k8s.io/e2e-framework/pkg/env"
	"sigs.k8s.io/e2e-framework/pkg/envconf"
	"sigs.k8s.io/e2e-framework/pkg/features"
)

var tenv env.Environment

var nParallelTests = flag.Int("n", 100, "Number of parallel tests to run")

func TestMain(m *testing.M) {
	cfg, err := envconf.NewFromFlags()
	if err != nil {
		panic(err)
	}
	cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
	cfg.WithParallelTestEnabled()
	tenv = env.NewWithConfig(cfg)

	os.Exit(tenv.Run(m))
}

func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
	c.Client().Resources()
	time.Sleep(5 * time.Second)
	return ctx
}

func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
	return ctx
}

func TestX1(t *testing.T) {
	t.Parallel()
	feats := []features.Feature{}
	for i := 0; i < *nParallelTests; i++ {
		feats = append(feats, generateFeat())

	}
	tenv.TestInParallel(t, feats...)
}

func TestX2(t *testing.T) {
	t.Parallel()
	// Making sure all parallel features in TestX1 have stared and created their client
	time.Sleep(2 * time.Second)
	tenv.Test(t, features.New("feature").
		Assess("A", func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
			file, err := os.Create(fmt.Sprintf("heap-%d.prof", *nParallelTests))
			if err != nil {
				t.Fatal(err)
			}
			defer file.Close()
			err = pprof.WriteHeapProfile(file)
			if err != nil {
				t.Fatal(err)
			}
			return ctx
		}).
		Feature())
}

func generateFeat() features.Feature {
	return features.New("feature").
		WithSetup("setup", Setup).
		Assess("A", Assess).
		Feature()
}

then N=10 go test ./ -test.v -race -kubeconfig $KUBECONFIG -args -n $N && go tool pprof -web -focus env heap-$N.prof

Apr 22 '24 09:04 Fricounet

Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:

💯

This can always be left as an exercise for those that would like to optimize this later on. This PR will give users the possibility to run tests in parallel without data races. Let's focus on that. If there's ever need ( I doubt that ) to make this more efficient, we can work on that then.

Thanks for all the tests that you've done @Fricounet 👍

Apr 22 '24 10:04 pmalek

I appreciate the thoroughness of the analysis in this PR. Thank you all.

Apr 24 '24 02:04 vladimirvivien

/retest

Apr 26 '24 01:04 harshanarayana

@Fricounet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-e2e-framework-verify	95976f707086be0a10437c1bf1c80a97e76f7370	link	true	`/test pull-e2e-framework-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Apr 26 '24 01:04 k8s-ci-robot