ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

Data race when adding reports from multiple goroutines

Open austince opened this issue 1 year ago • 0 comments

What I'm doing now, which may be completely misguided:

  • For long-running tests, I want to periodically poll some system information and add it to the spec reports.
  • I do this by spawning a goroutine in a BeforeEach, and closing it in the AfterEach
    func ReportPeriodically(interval time.Duration) {
      mu := sync.Mutex{}
      cancelMap := map[string]func(){}
    
      ginkgo.BeforeEach(func() {
      	report := ginkgo.CurrentSpecReport()
      	name := report.FullText()
    
      	ctx, cancel := context.WithCancel(context.Background())
    
      	mu.Lock()
      	defer mu.Unlock()
      	cancelMap[name] = cancel
      	ticker := time.NewTicker(interval)
      	go func() {
      		for {
                        // this can run twice since there is no priority selection, but should not matter 
      			select {
      			case now := <-ticker.C:
      				reportName := fmt.Sprintf("%s-periodic-%d", name, now.Unix())
      				ginkgo.AddReportEntry(reportName, "some system info")
      			case <-ctx.Done():
      				ticker.Stop()
      				return
      			}
      		}
      	}()
      })
    
      ginkgo.AfterEach(func() {
      	report := ginkgo.CurrentSpecReport()
      	name := report.FullText()
      	mu.Lock()
      	defer mu.Unlock()
      	if cancel, ok := cancelMap[name]; ok {
      		cancel()
      	}
      	delete(cancelMap, name)
      })
    }
    
  • I run the suite sequentially
  • Data Race:
    WARNING: DATA RACE
    Read at 0x00c0005824d0 by goroutine 18:
    github.com/onsi/ginkgo/v2/internal.(*Suite).AddReportEntry()
        /home/runner/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/internal/suite.go:226 +0x386
    github.com/onsi/ginkgo/v2.AddReportEntry()
        /home/runner/go/pkg/mod/github.com/onsi/ginkgo/[email protected]/reporting_dsl.go:69 +0x275
    

It seems the suite doesn't lock the entries array for reads/writes: https://github.com/onsi/ginkgo/blob/aae4480d35116455d7115e71120566e8be55b4a4/internal/suite.go#L226

I'm wondering if there's a better way to accomplish this, or if you'd accept a PR that protects access to the ReportEntries with a sync.RWMutex?

austince avatar Aug 09 '22 20:08 austince

hey @austince sorry for the delay. This is a cool use case. I believe the lack of a lock is an oversight on my part. If you'd be still be up for a PR that would be great. If not, I'll work on it.

One quick comment - I don't think you need the cancelMap. Only one spec ever runs at a time in a given Ginkgo process so you should be able to do:

func ReportPeriodically(interval time.Duration) {
  ginkgo.BeforeEach(func() {
  	name := ginkgo.CurrentSpecReport().FullText()

  	ctx, cancel := context.WithCancel(context.Background())

  	ticker := time.NewTicker(interval)
  	go func() {
  		for {
  			select {
  			case now := <-ticker.C:
  				reportName := fmt.Sprintf("%s-periodic-%d", name, now.Unix())
  				ginkgo.AddReportEntry(reportName, "some system info")
  			case <-ctx.Done():
  				ticker.Stop()
  				return
  			}
  		}
  	}()

        DeferCleanup(cancel)
  })
}

That DeferCleanup will call cancel for you after the spec completes (or fails)

onsi avatar Aug 24 '22 18:08 onsi

Ah, DeferCleanup, genius. Thanks for that. I'll put in a PR for the lock asap :)

austince avatar Aug 24 '22 18:08 austince

I'm actually not sure if I have the knowledge to finish the PR — it seems we need to lock more than just the append to the report entries, though I'm not sure why.

austince avatar Aug 24 '22 19:08 austince

hey thanks so much for taking the time to work on it. i really appreciate it. i'll take a look soon and circle back.

onsi avatar Aug 24 '22 21:08 onsi

hey @austince - it took me a bit to get my head back in the game but I think there are a couple of things going on here. One is an issue in Ginkgo (that isn't actually at play in this scenario - but is relevant as I'll explain below). The other is a subtle issue in your code. I'll tackle the latter first.


I was able to reproduce the race you're seeing quite easily - thanks for the clear example. At first I went down the road of locking access to the report in the two user-facing functions: CurrentSpecReport and AddReportEntry...

...and it didn't help. Then I noticed that the race was between the goroutine launched by ReportPeriodically calling AddReportEntry and Ginkgo's default reporter. This confused me... the reporter should be only running after your test has finished cleaning up (i.e. after the AfterEach or DeferCleanup runs). This ordering is protected by channels within Ginkgo so it's a strict ordering - the body of your AfterEach/DeferCleanup is guaranteed to finish before the report is passed to Ginkgo's reporter.

So why was the race?

And then I realized that the issue is that the call to cancel() in the AfterEach/DeferCleanup exits immediately and doesn't wait till the underlying goroutine finishes up. That means the groutine could still be running when the AfterEach/DeferCleanup exits and while the reporter is doing its thing. Moreover, it actually means that the grouting could still be running when the next test starts. Which means you'd end up with a report entry from one test making its way into the reports slice of another test.

So, in this case, the race detector is actually telling us that something is actually going on with the code that needs to be fixed. If you do this instead:

func ReportPeriodically(interval time.Duration) {
	ginkgo.BeforeEach(func() {
		name := ginkgo.CurrentSpecReport().FullText()

		ctx, cancel := context.WithCancel(context.Background())

		ticker := time.NewTicker(interval)
		done := make(chan interface{})
		go func() {
			for {
				select {
				case now := <-ticker.C:
					reportName := fmt.Sprintf("%s-periodic-%d", name, now.Unix())
					ginkgo.AddReportEntry(reportName, "some system info")
				case <-ctx.Done():
					ticker.Stop()
					close(done)
					return
				}
			}
		}()

		DeferCleanup(func() {
			cancel()
			<-done
		})
	})
}

you should see the race disappear, even when using the same version of Ginkgo.


But. This isn't the only issue at play. I realized that your perfectly valid usecase implies a class of races in Ginkgo that need to be resolved. Consider the following:

var _ = Context("reporting races", func() {
	ReportPeriodically(100 * time.Millisecond)

	It("shouldn't race... but does", func() {
		i := 0
		for i < 10 {
			ginkgo.AddReportEntry("boom")
			time.Sleep(50 * time.Millisecond)
			i++
		}
	})
})

Now, because the current version of Ginkgo don't account for concurrent calls to AddReportEntry, ginkgo -race will fail since the goroutine launched by ReportPeriodically will race with the call to AddReportEntry in the It. To fix this I need to add a lock in the implementations of AddReportEntry and CurrentSpecReport. Since both of these are the sole user-facing access points that can be hit concurrently, they should be the only places I need a lock.

I'm working on that now and will push it out tomorrow/Monday once I add some meaningful tests to ensure no races occur. But, you should be able to be unblocked with the current released version of Ginkgo by adding that done channel to ReportPeriodically.

onsi avatar Aug 28 '22 03:08 onsi

I've updated master and made AddReportEntry and CurrentSpecReport thread safe. I'll be cutting a release soon. Feel free to close this if you think we've resolved this @austince - or let me know if things don't make sense/aren't working.

onsi avatar Aug 29 '22 18:08 onsi

Ah, the done channel makes sense when you break it down. I've looked at the fix and it makes sense too, once everything else is synchronized. Thanks for digging in and fixing this, will close now!

austince avatar Aug 29 '22 21:08 austince