[env] Do not share config between tests
What type of PR is this?
/kind feature
What this PR does / why we need it:
Following discussions in https://github.com/kubernetes-sigs/e2e-framework/pull/367
This PR fixes a bug where the config object was shared between tests. When running tests in parallel, it was then impossible to rely on fields like the namespace because they could have been overwritten by another test. Also, it led to tests using -race to fail because the shared config.klient object that were updating the same fields when initializing the client.
This PR creates a new deepCopyConfig method that allows to create a deep copy of the config object. This way, each test can have its own without impacting the others. Now the config has the following lifecycle:
- It is created when a testEnv is created
- Each test uses a child testEnv that inherits the main testEnv's config
- Each feature uses a child testEnv that inherits the test's testEnv's config
This way, a feature inherits all the changes made in BeforeEach functions while not impacting the other features.
Which issue(s) this PR fixes:
Fixes #352 Fixes #384
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Improve how Config is shared across tests so that it can be used without any issues in parallel tests
Additional documentation e.g., Usage docs, etc.:
Hi @Fricounet. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @vladimirvivien
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Fricounet Once this PR has been reviewed and has the lgtm label, please ask for approval from vladimirvivien. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
I've tested this with a simple test (from #352) by checking out locally an adding a replace:
replace sigs.k8s.io/e2e-framework => ../e2e-framework
package main
import (
"context"
"os"
"testing"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/e2e-framework/klient/conf"
"sigs.k8s.io/e2e-framework/pkg/env"
"sigs.k8s.io/e2e-framework/pkg/envconf"
"sigs.k8s.io/e2e-framework/pkg/features"
)
var tenv env.Environment
func TestMain(m *testing.M) {
cfg, err := envconf.NewFromFlags()
if err != nil {
panic(err)
}
cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
tenv = env.NewWithConfig(cfg)
os.Exit(tenv.Run(m))
}
func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
res := c.Client().Resources()
podList := corev1.PodList{}
err := res.List(ctx, &podList)
if err != nil {
t.Fatalf("failed listing pods: %v", err)
}
return ctx
}
func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
return ctx
}
func TestX1(t *testing.T) {
t.Parallel()
tenv.Test(t, generateFeat())
}
func TestX2(t *testing.T) {
t.Parallel()
tenv.Test(t, generateFeat())
}
func generateFeat() features.Feature {
return features.New("feature").
WithSetup("setup", Setup).
Assess("A", Assess).
Feature()
}
( I needed to add additional replace directives to work around #399
and I still get a data race:
go test -v -race -count 1 -trimpath .
=== RUN TestX1
=== PAUSE TestX1
=== RUN TestX2
=== PAUSE TestX2
=== CONT TestX2
=== CONT TestX1
==================
WARNING: DATA RACE
Write at 0x00c00030ea80 by goroutine 26:
sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
sigs.k8s.io/[email protected]/pkg/envconf/config.go:141 +0x9c
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
sigs.k8s.io/[email protected]/pkg/env/env.go:601 +0x100
sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
sigs.k8s.io/[email protected]/pkg/env/env.go:130 +0xe4
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
sigs.k8s.io/[email protected]/pkg/env/env.go:260 +0x48
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
sigs.k8s.io/[email protected]/pkg/env/env.go:346 +0x60
e2e-framework-bug-352.TestX1()
e2e-framework-bug-352/main_test.go:44 +0xb8
testing.tRunner()
testing/testing.go:1689 +0x180
testing.(*T).Run.gowrap1()
testing/testing.go:1742 +0x40
Previous write at 0x00c00030ea80 by goroutine 27:
sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
sigs.k8s.io/[email protected]/pkg/envconf/config.go:141 +0x9c
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
sigs.k8s.io/[email protected]/pkg/env/env.go:601 +0x100
sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
sigs.k8s.io/[email protected]/pkg/env/env.go:130 +0xe4
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
sigs.k8s.io/[email protected]/pkg/env/env.go:260 +0x48
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
sigs.k8s.io/[email protected]/pkg/env/env.go:346 +0x60
e2e-framework-bug-352.TestX2()
e2e-framework-bug-352/main_test.go:49 +0xb8
testing.tRunner()
testing/testing.go:1689 +0x180
testing.(*T).Run.gowrap1()
testing/testing.go:1742 +0x40
Goroutine 26 (running) created at:
testing.(*T).Run()
testing/testing.go:1742 +0x5e4
testing.runTests.func1()
testing/testing.go:2161 +0x80
testing.tRunner()
testing/testing.go:1689 +0x180
testing.runTests()
testing/testing.go:2159 +0x6e0
testing.(*M).Run()
testing/testing.go:2027 +0xb74
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
sigs.k8s.io/[email protected]/pkg/env/env.go:408 +0x4c0
e2e-framework-bug-352.TestMain()
e2e-framework-bug-352/main_test.go:25 +0x250
main.main()
_testmain.go:51 +0x294
Goroutine 27 (running) created at:
testing.(*T).Run()
testing/testing.go:1742 +0x5e4
testing.runTests.func1()
testing/testing.go:2161 +0x80
testing.tRunner()
testing/testing.go:1689 +0x180
testing.runTests()
testing/testing.go:2159 +0x6e0
testing.(*M).Run()
testing/testing.go:2027 +0xb74
sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
sigs.k8s.io/[email protected]/pkg/env/env.go:408 +0x4c0
e2e-framework-bug-352.TestMain()
e2e-framework-bug-352/main_test.go:25 +0x250
main.main()
_testmain.go:51 +0x294
==================
=== RUN TestX2/feature
=== RUN TestX1/feature
=== NAME TestX2
testing.go:1398: race detected during execution of test
=== NAME TestX1
testing.go:1398: race detected during execution of test
=== RUN TestX2/feature/A
=== RUN TestX1/feature/A
--- FAIL: TestX2 (0.04s)
--- PASS: TestX2/feature (0.03s)
--- PASS: TestX2/feature/A (0.00s)
--- FAIL: TestX1 (0.04s)
--- PASS: TestX1/feature (0.03s)
--- PASS: TestX1/feature/A (0.00s)
FAIL
FAIL e2e-framework-bug-352 0.532s
FAIL
@pmalek indeed :thinking:
So it's the client again. This stems from the fact that Client() function is not a proper getter function like the other Config accessors but a mix of a setter and a getter. As a result, when it is accessed the first time by both tests, it is nil and they both try to create it at the same time. I can think of a few ways of solving this:
- Create a new client getter function like
GetClient()which would simply acts as a getter for the client and use it in deepCopyConfig instead ofClient() - Change the
Client()function to be a proper getter and find a new name for the currentClient(), maybeMustNewClient()since it panics in case of error? - Use a lock on the client. What you wanted to do here https://github.com/kubernetes-sigs/e2e-framework/pull/367. This would need some additional changes to make the lock available in the
envpackage but I think that could work
Each has its own quirks:
- is the easiest to setup and doesn't break the API. Drawback is that it is not inline with the current setter/getter convention on the
Configobject - Change the API to have saner and clearer method names. But breaks the API. However, since we are still in versions v0.X.Y, I think breaking the API can be considered fine and arguably, breaking it now to have something more future proof is OK in my opinion
- I don't like using locks where we can avoid it especially since this lock will be exported outside of the
configpkg to be accessible inenv. So anybody would be able to play around with this lock while it should stay an implementation detail
For now, I went with 1) because it is the easiest to do but I'd rather do 2) if we say it is fine to break the API
I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.
I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.
If it's possible to go this path I'd vote for that option. 👍
@phisco @pmalek I got rid of the caching by removing it altogether.
Drawback being that the client will be recreated every time it is accessed with config.Client(). I tried to rename my GetClient function to Client in order to have a simple getter that won't recreate the client each time. However it creates issues where the config.client object is never initialized because callers of Client do not expect to have to initialize it themselves and we cannot initialize it ourselves within the New... contructor functions because it happens to early and clusters might not be available yet (see #57). My try is available here if you want to take a look.
In any case, removing the caching altogether seems like an ok solution if we are fine with the performance cost induced by having to recreate the client all the time.
Drawback being that the client will be recreated every time it is accessed with
config.Client()
I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.
Drawback being that the client will be recreated every time it is accessed with
config.Client()I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.
+1, being this a testing framework I feel that shouldn't be the focus, but the call is to the maintainers to make for sure.
I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.
I personally don't think this overhead in performance will be that significant if we create client each time. But we should probably keep in mind the resource usage in terms of memory. Just to account for cases where one runs these in a sandbox-ed container env. We don't want them getting OOM'ed suddenly out of nowhere right ?
@harshanarayana this PR does indeed create some memory overhead. However, I did some profiling and observed the following:
- it only impacts parallel tests since serial tests will only create 1
Configobject at a time - each config seems to weighs around
600kBin memory.- Running 1 test in parallel:
- Running 10 tests in parallel:
- Running 100 tests in parallel:
- Running 1 test in parallel:
- this has nothing to do with caching the client or not. It is the fact that a new
Configobject has to be created for each test/feature that creates the overhead (which is the core reason for this PR in the first place).
Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:
- serial tests don't have a noticeable overhead
- parallel tests can now be run without race conditions on the
Config.
What do you think?
If you want to reproduce the profiling I did, I used the following code:
package main
import (
"context"
"flag"
"fmt"
"os"
"runtime/pprof"
"testing"
"time"
"sigs.k8s.io/e2e-framework/klient/conf"
"sigs.k8s.io/e2e-framework/pkg/env"
"sigs.k8s.io/e2e-framework/pkg/envconf"
"sigs.k8s.io/e2e-framework/pkg/features"
)
var tenv env.Environment
var nParallelTests = flag.Int("n", 100, "Number of parallel tests to run")
func TestMain(m *testing.M) {
cfg, err := envconf.NewFromFlags()
if err != nil {
panic(err)
}
cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
cfg.WithParallelTestEnabled()
tenv = env.NewWithConfig(cfg)
os.Exit(tenv.Run(m))
}
func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
c.Client().Resources()
time.Sleep(5 * time.Second)
return ctx
}
func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
return ctx
}
func TestX1(t *testing.T) {
t.Parallel()
feats := []features.Feature{}
for i := 0; i < *nParallelTests; i++ {
feats = append(feats, generateFeat())
}
tenv.TestInParallel(t, feats...)
}
func TestX2(t *testing.T) {
t.Parallel()
// Making sure all parallel features in TestX1 have stared and created their client
time.Sleep(2 * time.Second)
tenv.Test(t, features.New("feature").
Assess("A", func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
file, err := os.Create(fmt.Sprintf("heap-%d.prof", *nParallelTests))
if err != nil {
t.Fatal(err)
}
defer file.Close()
err = pprof.WriteHeapProfile(file)
if err != nil {
t.Fatal(err)
}
return ctx
}).
Feature())
}
func generateFeat() features.Feature {
return features.New("feature").
WithSetup("setup", Setup).
Assess("A", Assess).
Feature()
}
then N=10 go test ./ -test.v -race -kubeconfig $KUBECONFIG -args -n $N && go tool pprof -web -focus env heap-$N.prof
Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:
💯
This can always be left as an exercise for those that would like to optimize this later on. This PR will give users the possibility to run tests in parallel without data races. Let's focus on that. If there's ever need ( I doubt that ) to make this more efficient, we can work on that then.
Thanks for all the tests that you've done @Fricounet 👍
I appreciate the thoroughness of the analysis in this PR. Thank you all.
/retest
@Fricounet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-e2e-framework-verify | 95976f707086be0a10437c1bf1c80a97e76f7370 | link | true | /test pull-e2e-framework-verify |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.