ginkgo
ginkgo copied to clipboard
RFC: Ginkgo Timeout and Cleanup Proposal
I'd love some feedback on the following proposal for per-spec and per-node timeout support, as well as improved cleanup behavior, for Ginkgo 2.0.
Please take a look at the google doc and comment there:
https://docs.google.com/document/d/1pxeKk43-upOPJ6uEnNS8CTA0dRCJcxmNefwjdOzAvWI/edit#
Broader discussion can happen on this issue as well.
cc @pohly and @aojea - let's use this issue in lieu of #969 cc @johnSchnake as you'd opened a similar issue in #897
and cc @thediveo if you've got time and interest as a part of this will include a (fully backwards compatible) evolution of Eventually and I'd love your thoughts/input.
This is now implemented on master, pretty much as documented in the google doc. The docs have been updated as well. Please take a look if you're interested. I'll plan on releasing 2.3.0 in the next few days.
The gomega work is not done yet - that will happen next.
One quick thing: I think these times out specs might not appear correctly in the junit report. I'll take a look in the morning and fix it if need be.
The JUnit reports were, in fact, not correctly capturing the timeout failure state. This is now fixed.
Oh and, one last thing, the majority of the documentation for this feature set is here: https://onsi.github.io/ginkgo/#spec-timeouts-and-interruptible-nodes
Looks like bunches of context.Background() need to be replaced in my unit tests with these passed-in contexts...
The perils of writing a testing framework 3 years before the standard library officially adopted context and then not having any time for years and years to incorporate it into the framework.
But I think I'm happy with where things are landing. Thanks to @pohly for pointing in this direction - I think things fit into canonical Go much more nicely now.
OK - quick update:
- the latest Gomega now has
Eventually().WithContext()and can take aSpecContext - in addition, passing
EventuallyaSpecContextin this way allows it to register with Ginkgo so that Progress Reports (whether due to a timeout, or user-requested) will include the current state ofEventually's matcher.
From the documentation:
When a SpecTimeout expires the current node is interrupted (i.e. it's context is cancelled) and Ginkgo proceeds to run the clean up nodes subject to their own NodeTimeouts.
Are AfterEach nodes considered clean up nodes? Is that term defined somewhere? Even if it is, it might be clearer here to mentioned AfterEach and DeferCleanup nodes explicitly.
Yep AfterEach and DeferCleanup are both considered cleanup nodes. I'll update the docs.
From @pohly on k8s slack:
I "broke" one test by adding <-ctx.Done() at the end:
commit 3415e13b47bba954d26fff90a0e7cdb7c9ea1471
Author: Patrick Ohly <[email protected]>
Date: Sat Oct 8 14:20:46 2022 +0200
WIP: intentionally time out storage test
diff --git a/test/e2e/storage/testsuites/provisioning.go b/test/e2e/storage/testsuites/provisioning.go
index 14b147433ef..1cd896f0fc9 100644
--- a/test/e2e/storage/testsuites/provisioning.go
+++ b/test/e2e/storage/testsuites/provisioning.go
@@ -176,7 +176,7 @@ func (p *provisioningTestSuite) DefineTests(driver storageframework.TestDriver,
}
}
- ginkgo.It("should provision storage with mount options", func(ctx context.Context) {
+ ginkgo.It("should provision storage with mount options", ginkgo.NodeTimeout(5*time.Minute), func(ctx context.Context) {
if dInfo.SupportedMountOption == nil {
e2eskipper.Skipf("Driver %q does not define supported mount option - skipping", dInfo.Name)
}
@@ -193,6 +193,14 @@ func (p *provisioningTestSuite) DefineTests(driver storageframework.TestDriver,
SetupStorageClass(ctx, l.testCase.Client, l.testCase.Class)
l.testCase.TestDynamicProvisioning(ctx)
+
+ ginkgo.DeferCleanup(ginkgo.NodeTimeout(5*time.Second), func(ctx context.Context) {
+ ginkgo.By("Waiting for timeout during cleanup")
+ <-ctx.Done()
+ })
+
+ ginkgo.By("Waiting for timeout in test")
+ <-ctx.Done()
})
ginkgo.It("should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]", func(ctx context.Context) {
As expected, that test now times out - see "STEP: Waiting for timeout in test" in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-gce-csi-serial/1578722704666660864/build-log.txt.
I also get the expected "STEP: Waiting for timeout during cleanup" with a 5 second delay (the default grace period?!).
However, the
There's no indication that the DeferCleanup node itself also timed out. I think that should be treated like the earlier timeout and trigger a progress report.
To which @onsi replied:
Currently GInkgo only reports on the first failure for a given spec. With this more careful cleanup and timeout behavior it makes sense to track additional failures so I’ll work on that. (This is why you have no indication that DeferCleanup timed out). I might have Ginkgo only emit these additional failures to the console when running in verbose or very-verbose mode.
I ran a simple reproduction and saw that all the GingkoWriter and By statements do appear in the junit
The default grace period is 30s and in the case of this spec it should not be in play as things exit right after <-ctx.Done() so I would expect you to see the DeferCleanup without much delay (unless there is an AfterEach that is running first.
This is the simple reproducer to shows the GinkgoWriter output appearing in system-err. You can add this to a suite and run it locally:
var _ = It("does things", func(ctx SpecContext) {
GinkgoWriter.Println("Thing A")
GinkgoWriter.Println("Thing B")
DeferCleanup(func(ctx SpecContext) {
By("Waiting for timeout during cleanup")
GinkgoWriter.Println("Thing C")
<-ctx.Done()
}, NodeTimeout(time.Second))
By("Waiting for timeout during test")
<-ctx.Done()
}, NodeTimeout(time.Second))
The relevant bits of the junit output from running ginkgo --junit-report=out.junit:
<testcase name="[It] does things" classname="Foo Suite" status="timeout" time="2.005228167">
<failure message="This spec timed out and reported the following failure after the timeout:

My failure..." type="timeout">/Users/onsi/code/gomega/foo/foo_suite_test.go:28
This spec timed out and reported the following failure after the timeout:

My failure...
------------------------------
This is the Progress Report generated when the timeout occurred:
 does things (Spec Runtime: 1.001s)
 /Users/onsi/code/gomega/foo/foo_suite_test.go:16
 In [It] (Node Runtime: 1.001s)
 /Users/onsi/code/gomega/foo/foo_suite_test.go:16
 At [By Step] Waiting for timeout during test (Step Runtime: 1.001s)
 /Users/onsi/code/gomega/foo/foo_suite_test.go:26

 Spec Goroutine
 goroutine 23 [chan receive]
 > github.com/onsi/gomega/foo_test.glob..func1({0x100f917c8, 0x14000093840})
 /Users/onsi/code/gomega/foo/foo_suite_test.go:27
 | 
 | 	By("Waiting for timeout during test")
 > 	<-ctx.Done()
 | 	Fail("My failure...")
 | }, NodeTimeout(time.Second))
 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func2()
 /Users/onsi/code/ginkgo/internal/suite.go:750
 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode
 /Users/onsi/code/ginkgo/internal/suite.go:738
------------------------------
</failure>
<system-out>
Report Entries:
By Step
/Users/onsi/code/gomega/foo/foo_suite_test.go:26
2022-10-08T18:19:04.98692-06:00
&{Text:Waiting for timeout during test Duration:0s}
--
By Step
/Users/onsi/code/gomega/foo/foo_suite_test.go:21
2022-10-08T18:19:05.990033-06:00
&{Text:Waiting for timeout during cleanup Duration:0s}
</system-out>
<system-err>Thing A
Thing B
�[1mSTEP:�[0m Waiting for timeout during test �[38;5;243m10/08/22 18:19:04.986�[0m
�[1mSTEP:�[0m Waiting for timeout during cleanup �[38;5;243m10/08/22 18:19:05.99�[0m
Thing C
</system-err>
</testcase>
@pohly the latest Ginkgo on master will now include additional failures during cleanup in the console and machine-readable reports. PTAL - in your example you should now see that both the It timed out and the DeferCleanup timed out.
I updated ginkgo to a87ec827fd11 in https://github.com/kubernetes/kubernetes/pull/112923. I ran a similar "broken" test locally with -focus set so that only that one spec runs.
$ _output/bin/ginkgo -v -junit-report=junit.xml -progress -focus=csi-hostpath.*Dynamic.PV.*default.*volumes.should.store.data ./test/e2e
...
Timeout
In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
This is the Progress Report generated when the timeout occurred:
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes should store data (Spec Runtime: 3m0.018s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
In [It] (Node Runtime: 3m0.001s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
At [By Step] waiting for test timeout (Step Runtime: 2m13.887s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:199
Spec Goroutine
goroutine 1642 [chan receive, 2 minutes]
> k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumesTestSuite).DefineTests.func3({0x7f64d427ba40, 0xc001f44700})
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:200
|
| ginkgo.By("waiting for test timeout")
> <-ctx.Done()
...
There were additional failures detected after the initial failure:
[TIMEDOUT]
Timeout
In [DeferCleanup (Each)] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:194
This is the Progress Report generated when the timeout occurred:
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes should store data (Spec Runtime: 4m5.039s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
In [DeferCleanup (Each)] (Node Runtime: 1m0.001s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:194
At [By Step] waiting for cleanup timeout (Step Runtime: 1m0.001s)
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:195
Spec Goroutine
goroutine 1943 [chan receive, 2 minutes]
> k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumesTestSuite).DefineTests.func3.2({0x7f64d427ba40, 0xc00252f180})
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:196
| ginkgo.DeferCleanup(ginkgo.NodeTimeout(time.Minute), func(ctx context.Context) {
| ginkgo.By("waiting for cleanup timeout")
> <-ctx.Done()
...
[ReportAfterSuite] PASSED [0.045 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
Begin Captured GinkgoWriter Output >>
[ReportAfterSuite] TOP-LEVEL
autogenerated by Ginkgo
<< End Captured GinkgoWriter Output
------------------------------
Summarizing 1 Failure:
[TIMEDOUT] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes [It] should store data
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
Ran 1 of 6597 Specs in 257.618 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 6596 Skipped
--- FAIL: TestE2E (257.80s)
FAIL
I can see the system-err with proper content in this junit.xml. Reporting the timeout of the DeferCleanup also worked. I mentioned earlier that the JUnit report didn't include the GinkgoWriter output when using the Kubernetes -report-dir flag (= a custom ReportAfterSuite). That turned out to be because we filter out the output for specs if they don't have State == types.SpecStateFailed. After adding types.SpecStateTimedout it worked also there.
One problem that I have with system-err is that it doesn't include information about the timeout:
STEP: waiting for test timeout 10/10/22 20:39:55.541
<<here is where the test times out - nothing gets printed>>
STEP: cleaning the environment after hostpath 10/10/22 20:42:25.354
STEP: Deleting pvc 10/10/22 20:42:25.354
When reading that output, it is not clear where something failed. We solved this for Gomega failures by writing the failure first to the GinkgoWriter, then invoking ginkgo.Fail: https://github.com/kubernetes/kubernetes/blob/bb2101bac12748aa77adfa15b07b5b8d8dcd0672/test/e2e/framework/log.go#L54-L63
Ginkgo works as designed, I just wonder whether it would be worthwhile to have a "log failures to GinkgoWriter" feature that then would work for ginkgo.Fail and also timeouts detected by Ginkgo itself.
See https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-kind/1579548220873248768 for test results as shown by the Kubernetes CI.
ok - so you want the system-err to basically act as a temporal log of the spec? let me see what I can do. I already support his for ProgressReports and I don't think it will be too hard to support it for failures.
In the meantime the latest Gomega has a few nice new features. Eventually and Consistently can automatically forward context arguments to functions an there's now support for argument forwarding in general.
So... you no longer have to write:
It("fetches the correct count", func(ctx SpecContext) {
Eventually(func() int {
return client.FetchCount(ctx, "/users")
}, ctx).Should(BeNumerically(">=", 17))
}, SpecTimeout(time.Second))
you can do this instead:
It("fetches the correct count", func(ctx SpecContext) {
Eventually(client.FetchCount).WithContext(ctx).WithArguments("/users").Should(BeNumerically(">=", 17))
}, SpecTimeout(time.Second))
more is documented here:
https://onsi.github.io/gomega/#making-asynchronous-assertions
Are you hoping to see the entire failure in the GinkgoWriter log?
I could imagine a lot of users getting frustrated at the duplication of seeing the failure in the GinkgoWriter output and then in the report itself.
It seems the actual desire here is for a "timeline view" - which system-err in the JUnit report is close to. I'll explore what it looks like to make that notion more of a first class-citizen
Perhaps we can add a ginkgo.RegisterFailCallback similar to gomega.RegisterFailHandler? Then ginkgo.Fail could invoke such a callback before proceeding with handling the failure.
I had considered ginkgo.RegisterFailHandler, but such a handler would still need to call the original ginkgo.Fail somehow. Just hooking into failure handling seems simpler.
I'm going to give the topic some thought today. There are a number of "events" and "streams" that happen over the lifecycle of a spec:
- data is written to GinkgoWriter
- nodes start/finish running
- By is called
- ReportEntries are added
- ProgressReports are generated
- timeouts occur
- failures occur
these are all tracked and stored on the SpecReport independently and presented in structured ways when the test completes. But it makes sense to want a unified timeline view that shows all these interleaved and in order.
This is why I think it's more than just adding a hook to Fail. I want to step back and explore how to solve the more general problem.
To that end - I'd like to ship the timeout/interrupt/context stuff that's been done so far and then work on this timeline piece as a separate unit of work, next.
Are you hoping to see the entire failure in the GinkgoWriter log?
Yes. The way I see it, analysis of failures in Kubernetes goes like this:
- test failures are listed in a job overview with their failure message: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-kind/1579548220873248768
- those failure messages get aggregated to list all jobs that encountered the same error: https://storage.googleapis.com/k8s-triage/index.html?date=2022-09-26&pr=1&text=timeout
- reading the test output for the failed test is the next step
- followed by reading the output of the entire job - but that can be very large and contains potentially unrelated failures
To that end - I'd like to ship the timeout/interrupt/context stuff that's been done so far and then work on this timeline piece as a separate unit of work, next.
Makes sense.
Hi all, from the changelog https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md#interruptible-nodes-and-timeouts, grace period was introduced in v2.3.0, right? But I didn't find the corresponding release.
hey @kerthcet it's not out yet - sorry, I've been updating the CHANGELOG as I go but the release should come out in the next couple of days.
Alrighty - one last Gomega change that's related to all this Timeout and Interrupt work.
You can now return a StopTrying("message") error or simply call StopTrying("message").Now() to signal to Gomega that it should stop polling an Eventually/Consistently. @pohly I know this was something you were saying y'all could find useful. PTAL if you're interested, it's documented here
I don't see StopTrying in https://github.com/onsi/gomega/commits/master yet?
Lol - sorry i thought i'd pushed to github before i left the coffee shop ;)
i'll push it out when i'm back at my computer!
Alright - it's up now. Sorry about that!
Ginkgo 2.3.0 is out now with support for interruptible nodes, as is Gomega 1.22.0 with support for taking a context!
I wanted to call out one more thing to get thoughts/input/feedback before closing this issue and calling this feature done.
Currently the context provider by Ginkgo is not configured as a WithDeadline context and so calling ctx.Deadline() will always return (time.Time{}, false) This is the case even if the node in question has a timeout associated with it.
The reason for this is subtle but important. Ginkgo must take a Progress Report snapshot of the running spec just before the deadline expires in order to capture where the spec is potentially stuck. If we use a context.WithDeadline() context we set up a nondetermenistic race between Ginkgo's snapshot and the code responding to a cancelled context. So, instead, Ginkgo managed the cancellation of the context when the (independently tracked) deadline elapses.
If folks believe it is important that the context Ginkgo provides return a non-zero Deadline() I can look into simulating that behavior.
I don't think it is important.
Thank you @pohly and @onsi for the active development of this proposal. Was super excited to try this out today, and it and it works very well!
I'm wondering though if the following is expected or should be considered a bug:
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"go.uber.org/fx"
)
var _ = Describe("this fails", func() {
BeforeEach(func(ctx SpecContext) {
app := fx.New()
Expect(app.Start(ctx)).To(Succeed())
DeferCleanup(app.Stop, ctx) // this will always fail with "context canceled"
})
It("does something", func() {})
})
var _ = Describe("this succeeds #1", func() {
BeforeEach(func(ctx SpecContext) {
app := fx.New()
Expect(app.Start(ctx)).To(Succeed())
DeferCleanup(func(ctx SpecContext) error { return app.Stop(ctx) })
})
It("does something", func() {})
})
var _ = Describe("this succeeds #2", func() {
var app *fx.App
BeforeEach(func(ctx SpecContext) {
app = fx.New()
Expect(app.Start(ctx)).To(Succeed())
})
AfterEach(func(ctx SpecContext) {
Expect(app.Stop(ctx)).To(Succeed())
})
It("does something", func() {})
})
will output the following:
Will run 3 of 19 specs
[Fx] PROVIDE fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
•SS[Fx] PROVIDE fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
•[Fx] PROVIDE fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
[Fx] ERROR Failed to stop cleanly: context canceled
------------------------------
• [FAILED] [0.000 seconds]
this fails [DeferCleanup (Each)]
/Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13
does something
/Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:16
DeferCleanup callback returned error: context canceled
In [DeferCleanup (Each)] at: /Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13
------------------------------
SSSSSSSSSSSSSS
Summarizing 1 Failure:
[FAIL] this fails [DeferCleanup (Each)] does something
/Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13
Ran 3 of 19 Specs in 0.001 seconds
FAIL! -- 2 Passed | 1 Failed | 0 Pending | 16 Skipped
--- FAIL: TestPostgres (0.00s)
FAIL
Ginkgo ran 1 suite in 1.454809958s
Test Suite Failed
At first I would expect all of these to succeed equally, but I guess the DeferCleanup in the one gets the ctx from the BeforeEach node while in other cases a new node is created(?). But maybe the way I call DeferCleanup with a context is common enough to be an exception? Not sure