ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

RFC: Ginkgo Timeout and Cleanup Proposal

Open onsi opened this issue 3 years ago • 1 comments

I'd love some feedback on the following proposal for per-spec and per-node timeout support, as well as improved cleanup behavior, for Ginkgo 2.0.

Please take a look at the google doc and comment there:

https://docs.google.com/document/d/1pxeKk43-upOPJ6uEnNS8CTA0dRCJcxmNefwjdOzAvWI/edit#

Broader discussion can happen on this issue as well.

onsi avatar Sep 19 '22 20:09 onsi

cc @pohly and @aojea - let's use this issue in lieu of #969 cc @johnSchnake as you'd opened a similar issue in #897

and cc @thediveo if you've got time and interest as a part of this will include a (fully backwards compatible) evolution of Eventually and I'd love your thoughts/input.

onsi avatar Sep 19 '22 20:09 onsi

This is now implemented on master, pretty much as documented in the google doc. The docs have been updated as well. Please take a look if you're interested. I'll plan on releasing 2.3.0 in the next few days.

The gomega work is not done yet - that will happen next.

onsi avatar Oct 05 '22 03:10 onsi

One quick thing: I think these times out specs might not appear correctly in the junit report. I'll take a look in the morning and fix it if need be.

onsi avatar Oct 05 '22 04:10 onsi

The JUnit reports were, in fact, not correctly capturing the timeout failure state. This is now fixed.

onsi avatar Oct 05 '22 14:10 onsi

Oh and, one last thing, the majority of the documentation for this feature set is here: https://onsi.github.io/ginkgo/#spec-timeouts-and-interruptible-nodes

onsi avatar Oct 05 '22 18:10 onsi

Looks like bunches of context.Background() need to be replaced in my unit tests with these passed-in contexts...

thediveo avatar Oct 05 '22 18:10 thediveo

The perils of writing a testing framework 3 years before the standard library officially adopted context and then not having any time for years and years to incorporate it into the framework.

But I think I'm happy with where things are landing. Thanks to @pohly for pointing in this direction - I think things fit into canonical Go much more nicely now.

onsi avatar Oct 05 '22 20:10 onsi

OK - quick update:

  • the latest Gomega now has Eventually().WithContext() and can take a SpecContext
  • in addition, passing Eventually a SpecContext in this way allows it to register with Ginkgo so that Progress Reports (whether due to a timeout, or user-requested) will include the current state of Eventually's matcher.

onsi avatar Oct 06 '22 22:10 onsi

From the documentation:

When a SpecTimeout expires the current node is interrupted (i.e. it's context is cancelled) and Ginkgo proceeds to run the clean up nodes subject to their own NodeTimeouts.

Are AfterEach nodes considered clean up nodes? Is that term defined somewhere? Even if it is, it might be clearer here to mentioned AfterEach and DeferCleanup nodes explicitly.

pohly avatar Oct 08 '22 12:10 pohly

Yep AfterEach and DeferCleanup are both considered cleanup nodes. I'll update the docs.

onsi avatar Oct 08 '22 12:10 onsi

From @pohly on k8s slack:

I "broke" one test by adding <-ctx.Done() at the end:

commit 3415e13b47bba954d26fff90a0e7cdb7c9ea1471
Author: Patrick Ohly <[email protected]>
Date:   Sat Oct 8 14:20:46 2022 +0200

    WIP: intentionally time out storage test

diff --git a/test/e2e/storage/testsuites/provisioning.go b/test/e2e/storage/testsuites/provisioning.go
index 14b147433ef..1cd896f0fc9 100644
--- a/test/e2e/storage/testsuites/provisioning.go
+++ b/test/e2e/storage/testsuites/provisioning.go
@@ -176,7 +176,7 @@ func (p *provisioningTestSuite) DefineTests(driver storageframework.TestDriver,
 		}
 	}
 
-	ginkgo.It("should provision storage with mount options", func(ctx context.Context) {
+	ginkgo.It("should provision storage with mount options", ginkgo.NodeTimeout(5*time.Minute), func(ctx context.Context) {
 		if dInfo.SupportedMountOption == nil {
 			e2eskipper.Skipf("Driver %q does not define supported mount option - skipping", dInfo.Name)
 		}
@@ -193,6 +193,14 @@ func (p *provisioningTestSuite) DefineTests(driver storageframework.TestDriver,
 		SetupStorageClass(ctx, l.testCase.Client, l.testCase.Class)
 
 		l.testCase.TestDynamicProvisioning(ctx)
+
+		ginkgo.DeferCleanup(ginkgo.NodeTimeout(5*time.Second), func(ctx context.Context) {
+			ginkgo.By("Waiting for timeout during cleanup")
+			<-ctx.Done()
+		})
+
+		ginkgo.By("Waiting for timeout in test")
+		<-ctx.Done()
 	})
 
 	ginkgo.It("should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]", func(ctx context.Context) {

As expected, that test now times out - see "STEP: Waiting for timeout in test" in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-gce-csi-serial/1578722704666660864/build-log.txt. I also get the expected "STEP: Waiting for timeout during cleanup" with a 5 second delay (the default grace period?!). However, the in the https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-gce-csi-serial/1578722704666660864/artifacts/junit_01.xml JUnit file only contains the timeout error that was generated by Ginkgo, which is the same as the failure message. I expected the captured Ginkgo output there. New

There's no indication that the DeferCleanup node itself also timed out. I think that should be treated like the earlier timeout and trigger a progress report.

To which @onsi replied:

Currently GInkgo only reports on the first failure for a given spec. With this more careful cleanup and timeout behavior it makes sense to track additional failures so I’ll work on that. (This is why you have no indication that DeferCleanup timed out). I might have Ginkgo only emit these additional failures to the console when running in verbose or very-verbose mode.

I ran a simple reproduction and saw that all the GingkoWriter and By statements do appear in the junit field; so I’m not sure what’s happening in your case… is there anything downstream that processes the junit files?

The default grace period is 30s and in the case of this spec it should not be in play as things exit right after <-ctx.Done() so I would expect you to see the DeferCleanup without much delay (unless there is an AfterEach that is running first.

This is the simple reproducer to shows the GinkgoWriter output appearing in system-err. You can add this to a suite and run it locally:

var _ = It("does things", func(ctx SpecContext) {
	GinkgoWriter.Println("Thing A")
	GinkgoWriter.Println("Thing B")

	DeferCleanup(func(ctx SpecContext) {
		By("Waiting for timeout during cleanup")
		GinkgoWriter.Println("Thing C")
		<-ctx.Done()
	}, NodeTimeout(time.Second))

	By("Waiting for timeout during test")
	<-ctx.Done()
}, NodeTimeout(time.Second))

The relevant bits of the junit output from running ginkgo --junit-report=out.junit:

<testcase name="[It] does things" classname="Foo Suite" status="timeout" time="2.005228167">
              <failure message="This spec timed out and reported the following failure after the timeout:&#xA;&#xA;My failure..." type="timeout">/Users/onsi/code/gomega/foo/foo_suite_test.go:28&#xA;This spec timed out and reported the following failure after the timeout:&#xA;&#xA;My failure...&#xA;------------------------------&#xA;This is the Progress Report generated when the timeout occurred:&#xA;  does things (Spec Runtime: 1.001s)&#xA;    /Users/onsi/code/gomega/foo/foo_suite_test.go:16&#xA;    In [It] (Node Runtime: 1.001s)&#xA;      /Users/onsi/code/gomega/foo/foo_suite_test.go:16&#xA;      At [By Step] Waiting for timeout during test (Step Runtime: 1.001s)&#xA;        /Users/onsi/code/gomega/foo/foo_suite_test.go:26&#xA;&#xA;      Spec Goroutine&#xA;      goroutine 23 [chan receive]&#xA;      &gt; github.com/onsi/gomega/foo_test.glob..func1({0x100f917c8, 0x14000093840})&#xA;          /Users/onsi/code/gomega/foo/foo_suite_test.go:27&#xA;            | &#xA;            | &#x9;By(&#34;Waiting for timeout during test&#34;)&#xA;            &gt; &#x9;&lt;-ctx.Done()&#xA;            | &#x9;Fail(&#34;My failure...&#34;)&#xA;            | }, NodeTimeout(time.Second))&#xA;        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func2()&#xA;          /Users/onsi/code/ginkgo/internal/suite.go:750&#xA;        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode&#xA;          /Users/onsi/code/ginkgo/internal/suite.go:738&#xA;------------------------------&#xA;</failure>
          <system-out>&#xA;Report Entries:&#xA;By Step&#xA;/Users/onsi/code/gomega/foo/foo_suite_test.go:26&#xA;2022-10-08T18:19:04.98692-06:00&#xA;&amp;{Text:Waiting for timeout during test Duration:0s}&#xA;--&#xA;By Step&#xA;/Users/onsi/code/gomega/foo/foo_suite_test.go:21&#xA;2022-10-08T18:19:05.990033-06:00&#xA;&amp;{Text:Waiting for timeout during cleanup Duration:0s}&#xA;</system-out>
          <system-err>Thing A&#xA;Thing B&#xA;�[1mSTEP:�[0m Waiting for timeout during test �[38;5;243m10/08/22 18:19:04.986�[0m&#xA;�[1mSTEP:�[0m Waiting for timeout during cleanup �[38;5;243m10/08/22 18:19:05.99�[0m&#xA;Thing C&#xA;</system-err>
</testcase>

onsi avatar Oct 09 '22 01:10 onsi

@pohly the latest Ginkgo on master will now include additional failures during cleanup in the console and machine-readable reports. PTAL - in your example you should now see that both the It timed out and the DeferCleanup timed out.

onsi avatar Oct 10 '22 16:10 onsi

I updated ginkgo to a87ec827fd11 in https://github.com/kubernetes/kubernetes/pull/112923. I ran a similar "broken" test locally with -focus set so that only that one spec runs.

$ _output/bin/ginkgo -v -junit-report=junit.xml -progress -focus=csi-hostpath.*Dynamic.PV.*default.*volumes.should.store.data ./test/e2e
...
  Timeout
  In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160

  This is the Progress Report generated when the timeout occurred:
    [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes should store data (Spec Runtime: 3m0.018s)
      /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
      In [It] (Node Runtime: 3m0.001s)
        /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
        At [By Step] waiting for test timeout (Step Runtime: 2m13.887s)
          /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:199

        Spec Goroutine
        goroutine 1642 [chan receive, 2 minutes]
        > k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumesTestSuite).DefineTests.func3({0x7f64d427ba40, 0xc001f44700})
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:200
              | 
              | 	ginkgo.By("waiting for test timeout")
              > 	<-ctx.Done()
...
  There were additional failures detected after the initial failure:
    [TIMEDOUT]
    Timeout
    In [DeferCleanup (Each)] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:194

    This is the Progress Report generated when the timeout occurred:
      [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes should store data (Spec Runtime: 4m5.039s)
        /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160
        In [DeferCleanup (Each)] (Node Runtime: 1m0.001s)
          /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:194
          At [By Step] waiting for cleanup timeout (Step Runtime: 1m0.001s)
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:195

          Spec Goroutine
          goroutine 1943 [chan receive, 2 minutes]
          > k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumesTestSuite).DefineTests.func3.2({0x7f64d427ba40, 0xc00252f180})
              /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:196
                | ginkgo.DeferCleanup(ginkgo.NodeTimeout(time.Minute), func(ctx context.Context) {
                | 	ginkgo.By("waiting for cleanup timeout")
                > 	<-ctx.Done()
...
[ReportAfterSuite] PASSED [0.045 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo

  Begin Captured GinkgoWriter Output >>
    [ReportAfterSuite] TOP-LEVEL
      autogenerated by Ginkgo
  << End Captured GinkgoWriter Output
------------------------------


Summarizing 1 Failure:
  [TIMEDOUT] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes [It] should store data
  /nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:160

Ran 1 of 6597 Specs in 257.618 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 6596 Skipped
--- FAIL: TestE2E (257.80s)
FAIL

I can see the system-err with proper content in this junit.xml. Reporting the timeout of the DeferCleanup also worked. I mentioned earlier that the JUnit report didn't include the GinkgoWriter output when using the Kubernetes -report-dir flag (= a custom ReportAfterSuite). That turned out to be because we filter out the output for specs if they don't have State == types.SpecStateFailed. After adding types.SpecStateTimedout it worked also there.

One problem that I have with system-err is that it doesn't include information about the timeout:

STEP: waiting for test timeout 10/10/22 20:39:55.541
<<here is where the test times out - nothing gets printed>>
STEP: cleaning the environment after hostpath 10/10/22 20:42:25.354
STEP: Deleting pvc 10/10/22 20:42:25.354

When reading that output, it is not clear where something failed. We solved this for Gomega failures by writing the failure first to the GinkgoWriter, then invoking ginkgo.Fail: https://github.com/kubernetes/kubernetes/blob/bb2101bac12748aa77adfa15b07b5b8d8dcd0672/test/e2e/framework/log.go#L54-L63

Ginkgo works as designed, I just wonder whether it would be worthwhile to have a "log failures to GinkgoWriter" feature that then would work for ginkgo.Fail and also timeouts detected by Ginkgo itself.

pohly avatar Oct 10 '22 19:10 pohly

See https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-kind/1579548220873248768 for test results as shown by the Kubernetes CI.

pohly avatar Oct 10 '22 20:10 pohly

ok - so you want the system-err to basically act as a temporal log of the spec? let me see what I can do. I already support his for ProgressReports and I don't think it will be too hard to support it for failures.

In the meantime the latest Gomega has a few nice new features. Eventually and Consistently can automatically forward context arguments to functions an there's now support for argument forwarding in general.

So... you no longer have to write:

It("fetches the correct count", func(ctx SpecContext) {
    Eventually(func() int {
        return client.FetchCount(ctx, "/users")
    }, ctx).Should(BeNumerically(">=", 17))
}, SpecTimeout(time.Second))

you can do this instead:

It("fetches the correct count", func(ctx SpecContext) {
    Eventually(client.FetchCount).WithContext(ctx).WithArguments("/users").Should(BeNumerically(">=", 17))
}, SpecTimeout(time.Second))

more is documented here:

https://onsi.github.io/gomega/#making-asynchronous-assertions

onsi avatar Oct 10 '22 21:10 onsi

Are you hoping to see the entire failure in the GinkgoWriter log?

I could imagine a lot of users getting frustrated at the duplication of seeing the failure in the GinkgoWriter output and then in the report itself.

It seems the actual desire here is for a "timeline view" - which system-err in the JUnit report is close to. I'll explore what it looks like to make that notion more of a first class-citizen

onsi avatar Oct 10 '22 22:10 onsi

Perhaps we can add a ginkgo.RegisterFailCallback similar to gomega.RegisterFailHandler? Then ginkgo.Fail could invoke such a callback before proceeding with handling the failure.

I had considered ginkgo.RegisterFailHandler, but such a handler would still need to call the original ginkgo.Fail somehow. Just hooking into failure handling seems simpler.

pohly avatar Oct 11 '22 07:10 pohly

I'm going to give the topic some thought today. There are a number of "events" and "streams" that happen over the lifecycle of a spec:

  • data is written to GinkgoWriter
  • nodes start/finish running
  • By is called
  • ReportEntries are added
  • ProgressReports are generated
  • timeouts occur
  • failures occur

these are all tracked and stored on the SpecReport independently and presented in structured ways when the test completes. But it makes sense to want a unified timeline view that shows all these interleaved and in order.

This is why I think it's more than just adding a hook to Fail. I want to step back and explore how to solve the more general problem.

To that end - I'd like to ship the timeout/interrupt/context stuff that's been done so far and then work on this timeline piece as a separate unit of work, next.

onsi avatar Oct 11 '22 12:10 onsi

Are you hoping to see the entire failure in the GinkgoWriter log?

Yes. The way I see it, analysis of failures in Kubernetes goes like this:

  • test failures are listed in a job overview with their failure message: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/112923/pull-kubernetes-e2e-kind/1579548220873248768
  • those failure messages get aggregated to list all jobs that encountered the same error: https://storage.googleapis.com/k8s-triage/index.html?date=2022-09-26&pr=1&text=timeout
  • reading the test output for the failed test is the next step
  • followed by reading the output of the entire job - but that can be very large and contains potentially unrelated failures

pohly avatar Oct 11 '22 13:10 pohly

To that end - I'd like to ship the timeout/interrupt/context stuff that's been done so far and then work on this timeline piece as a separate unit of work, next.

Makes sense.

pohly avatar Oct 11 '22 13:10 pohly

Hi all, from the changelog https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md#interruptible-nodes-and-timeouts, grace period was introduced in v2.3.0, right? But I didn't find the corresponding release.

kerthcet avatar Oct 11 '22 16:10 kerthcet

hey @kerthcet it's not out yet - sorry, I've been updating the CHANGELOG as I go but the release should come out in the next couple of days.

onsi avatar Oct 11 '22 17:10 onsi

Alrighty - one last Gomega change that's related to all this Timeout and Interrupt work.

You can now return a StopTrying("message") error or simply call StopTrying("message").Now() to signal to Gomega that it should stop polling an Eventually/Consistently. @pohly I know this was something you were saying y'all could find useful. PTAL if you're interested, it's documented here

onsi avatar Oct 11 '22 17:10 onsi

I don't see StopTrying in https://github.com/onsi/gomega/commits/master yet?

pohly avatar Oct 11 '22 17:10 pohly

Lol - sorry i thought i'd pushed to github before i left the coffee shop ;)

i'll push it out when i'm back at my computer!

onsi avatar Oct 11 '22 17:10 onsi

Alright - it's up now. Sorry about that!

onsi avatar Oct 11 '22 18:10 onsi

Ginkgo 2.3.0 is out now with support for interruptible nodes, as is Gomega 1.22.0 with support for taking a context!

onsi avatar Oct 11 '22 19:10 onsi

I wanted to call out one more thing to get thoughts/input/feedback before closing this issue and calling this feature done.

Currently the context provider by Ginkgo is not configured as a WithDeadline context and so calling ctx.Deadline() will always return (time.Time{}, false) This is the case even if the node in question has a timeout associated with it.

The reason for this is subtle but important. Ginkgo must take a Progress Report snapshot of the running spec just before the deadline expires in order to capture where the spec is potentially stuck. If we use a context.WithDeadline() context we set up a nondetermenistic race between Ginkgo's snapshot and the code responding to a cancelled context. So, instead, Ginkgo managed the cancellation of the context when the (independently tracked) deadline elapses.

If folks believe it is important that the context Ginkgo provides return a non-zero Deadline() I can look into simulating that behavior.

onsi avatar Oct 17 '22 18:10 onsi

I don't think it is important.

pohly avatar Oct 17 '22 18:10 pohly

Thank you @pohly and @onsi for the active development of this proposal. Was super excited to try this out today, and it and it works very well!

I'm wondering though if the following is expected or should be considered a bug:

import (
	. "github.com/onsi/ginkgo/v2"
	. "github.com/onsi/gomega"
	"go.uber.org/fx"
)

var _ = Describe("this fails", func() {
	BeforeEach(func(ctx SpecContext) {
		app := fx.New()
		Expect(app.Start(ctx)).To(Succeed())
		DeferCleanup(app.Stop, ctx) // this will always fail with "context canceled"
	})

	It("does something", func() {})
})

var _ = Describe("this succeeds #1", func() {
	BeforeEach(func(ctx SpecContext) {
		app := fx.New()
		Expect(app.Start(ctx)).To(Succeed())
		DeferCleanup(func(ctx SpecContext) error { return app.Stop(ctx) })
	})

	It("does something", func() {})
})

var _ = Describe("this succeeds #2", func() {
	var app *fx.App
	BeforeEach(func(ctx SpecContext) {
		app = fx.New()
		Expect(app.Start(ctx)).To(Succeed())
	})
	AfterEach(func(ctx SpecContext) {
		Expect(app.Stop(ctx)).To(Succeed())
	})

	It("does something", func() {})
})

will output the following:

Will run 3 of 19 specs
[Fx] PROVIDE	fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE	fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE	fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
•SS[Fx] PROVIDE	fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE	fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE	fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
•[Fx] PROVIDE	fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE	fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE	fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] RUNNING
[Fx] ERROR		Failed to stop cleanly: context canceled

------------------------------
• [FAILED] [0.000 seconds]
this fails [DeferCleanup (Each)]
/Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13
  does something
  /Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:16

  DeferCleanup callback returned error: context canceled
  In [DeferCleanup (Each)] at: /Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13
------------------------------
SSSSSSSSSSSSSS


Summarizing 1 Failure:
  [FAIL] this fails [DeferCleanup (Each)] does something
  /Users/adam/Documents/Projects/github.com/crewlinker/core/internal/postgres/listen_test.go:13

Ran 3 of 19 Specs in 0.001 seconds
FAIL! -- 2 Passed | 1 Failed | 0 Pending | 16 Skipped
--- FAIL: TestPostgres (0.00s)
FAIL

Ginkgo ran 1 suite in 1.454809958s

Test Suite Failed

At first I would expect all of these to succeed equally, but I guess the DeferCleanup in the one gets the ctx from the BeforeEach node while in other cases a new node is created(?). But maybe the way I call DeferCleanup with a context is common enough to be an exception? Not sure

advdv avatar Oct 18 '22 07:10 advdv