ccache icon indicating copy to clipboard operation
ccache copied to clipboard

Stop can cause race detector errors in testing

Open eli-darkly opened this issue 3 years ago • 14 comments

The unit tests for our project are normally run with the -race option, because we do a lot of concurrency stuff and want to avoid subtle unsafe usages. When I integrated ccache into the project, I started getting race detector errors in a test scenario where the cache is shut down with Stop() at the end of the test.

It seems that this is due to what the race detector considers to be unsafe usage of the promotables channel, where there is the potential for a race between close and a previous channel send, as documented here. The race detector isn't saying that a race really did happen during the test run, but it can tell, based on the pattern of accesses to the channel, that one could happen— so it considers that to be an automatic fail.

I wondered why such an issue wouldn't have shown up in ccache's own unit tests, but that's because—

  1. Those tests aren't being run in race detection mode.
  2. The tests are not calling Stop at all. Like, there's no defer cache.Stop() after creating a store (so I imagine there are a lot of orphaned goroutines being created during test runs)— and also there doesn't seem be any test coverage of Stop itself.

When I added a single defer cache.Stop() to a test, and then ran go test -race ./... instead of go test ./..., I immediately got the same kind of error. In any codebase where concurrency is very important, like this one, this is a bit concerning. Even if this particular kind of race condition might not have significant consequences in itself, the fact that it's not possible to run tests with race detection means we can't use that tool to detect other kinds of concurrency problems.

eli-darkly avatar Mar 17 '21 23:03 eli-darkly

Btw, when I say that Stop doesn't have test coverage... I see that it is used by the test helper function checkSize, so it does get called several times during test runs. So that's not as bad as having no coverage at all; it's not like you could replace the method body with a panic and still have green CI builds. But, for whatever reason, the tests that use that helper are not triggering the race detector, whereas adding defer cache.Stop() to any of the tests that didn't already do a Stop does trigger it. I presume that this is due to some difference in the cache usage pattern of the tests.

eli-darkly avatar Mar 17 '21 23:03 eli-darkly

I've also intermittently seen another race detector error when running the tests. This one doesn't seem to involve channels and I'm not quite sure what is going on with it— based on the stacktrace it seems like maybe it's trying to print an error message that references an *Item, and fmt.Sprintf is trying to read the properties of the Item in a non-thread-safe way. In that case, the issue isn't really with the cache code itself but just the way these values are being passed around in the test. That is less of an immediate problem for me, because (unlike the Stop issue) it doesn't prevent me from running my project's tests in race detection mode, but it is still an obstacle to being able to use race detection mode to verify ccache's own concurrency behavior.

But also, this might indicate that the test is actually unstable since I wouldn't expect it to be calling Runner.Errorf unless an expectation failed.

WARNING: DATA RACE
Read at 0x00c000167f60 by goroutine 42:
  reflect.typedmemmove()
      /Users/elibishop/.gvm/gos/go1.14/src/runtime/mbarrier.go:177 +0x0
  reflect.packEface()
      /Users/elibishop/.gvm/gos/go1.14/src/reflect/value.go:119 +0x126
  reflect.valueInterface()
      /Users/elibishop/.gvm/gos/go1.14/src/reflect/value.go:1023 +0x1b9
  reflect.Value.Interface()
      /Users/elibishop/.gvm/gos/go1.14/src/reflect/value.go:993 +0x3c27
  fmt.(*pp).printValue()
      /Users/elibishop/.gvm/gos/go1.14/src/fmt/print.go:726 +0x3c28
  fmt.(*pp).printValue()
      /Users/elibishop/.gvm/gos/go1.14/src/fmt/print.go:880 +0x2709
  fmt.(*pp).printArg()
      /Users/elibishop/.gvm/gos/go1.14/src/fmt/print.go:716 +0x25a
  fmt.(*pp).doPrintf()
      /Users/elibishop/.gvm/gos/go1.14/src/fmt/print.go:1030 +0x311
  fmt.Sprintf()
      /Users/elibishop/.gvm/gos/go1.14/src/fmt/print.go:219 +0x73
  github.com/karlseguin/expect.(*Runner).Errorf()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/runner.go:222 +0x12f
  github.com/karlseguin/expect.glob..func5()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/expect.go:55 +0x7c
  github.com/karlseguin/expect.showError()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/expect.go:255 +0x55b
  github.com/karlseguin/expect.equal()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/expect.go:154 +0x1d3
  github.com/karlseguin/expect.(*ToExpectation).Equal()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/expect.go:111 +0x19b
  github.com/launchdarkly/ccache.LayeredCacheTests.ResizeOnTheFly()
      /Users/elibishop/work/ccache/layeredcache_test.go:250 +0xa4d
  github.com/launchdarkly/ccache.(*LayeredCacheTests).ResizeOnTheFly()
      <autogenerated>:1 +0x40
  runtime.call32()
      /Users/elibishop/.gvm/gos/go1.14/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /Users/elibishop/.gvm/gos/go1.14/src/reflect/value.go:321 +0xd3
  github.com/karlseguin/expect.Expectify.func2()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/runner.go:112 +0x107
  github.com/karlseguin/expect.Expectify()
      /Users/elibishop/.gvm/pkgsets/go1.14/global/pkg/mod/github.com/karlseguin/[email protected]/runner.go:127 +0xb02
  github.com/launchdarkly/ccache.Test_LayeredCache()
      /Users/elibishop/work/ccache/layeredcache_test.go:16 +0x50
  testing.tRunner()
      /Users/elibishop/.gvm/gos/go1.14/src/testing/testing.go:992 +0x1eb

Previous write at 0x00c000167f60 by goroutine 64:
  github.com/launchdarkly/ccache.(*Item).shouldPromote()
      /Users/elibishop/work/ccache/item.go:74 +0x108
  github.com/launchdarkly/ccache.(*LayeredCache).doPromote()
      /Users/elibishop/work/ccache/layeredcache.go:272 +0x9a
  github.com/launchdarkly/ccache.(*LayeredCache).worker()
      /Users/elibishop/work/ccache/layeredcache.go:231 +0xab8

eli-darkly avatar Mar 17 '21 23:03 eli-darkly

I'm happy to try to fix these.

I believe I've fixed the intermittent issue that you were seeing. One of the fundamental problems with ccache tests is that many ccache behaviors are asynchronous, so our test needs to sleep for a random period of time and hope the behavior under test finished executing. Go's select, unlike Erlang's message queue, provide no ordering (FIFO) guarantees - so (as far as I can tell), we have no options but to sleep.

The LayeredCaches' ResizeOnTheFly test was having problems because of this. And when the following assertion failed:

Expect(cache.Get("0", "a")).To.Equal(nil)

We'd have a race condition because the test runner was reading values that it shouldn't (via reflection). Making the test more robust removes this issue.

I've added defer c.Stop() to a few tests, but haven't been able to reproduce this yet.

karlseguin avatar Mar 18 '21 11:03 karlseguin

Yesterday I was able to reproduce the "Stop causes a race on the channel" behavior pretty consistently, both in my project's tests that use ccache and also in ccache's own tests. But today, it's happening only in my project's tests and the only one I can reproduce in the ccache tests is the ResizeOnTheFly problem. I expect this is very timing-dependent so maybe overall execution speed on my machine is different today due to the load from other processes.

eli-darkly avatar Mar 18 '21 18:03 eli-darkly

This is a tangent, but about the "test runner reading values that it shouldn't" thing, you might be able to address that by defining a String() method on Item; then fmt.Sprintf should use that instead of reflectively reading the struct fields. That wouldn't fix the underlying test instability, but it would prevent the confusing misreporting of what the test failure was.

eli-darkly avatar Mar 18 '21 18:03 eli-darkly

About waiting for asynchronous things in the tests, like the changing value of GetDropped() - what I've found to be more reliable than adding arbitrary sleeps in cases like this is to poll the condition.

It's not hard to write a helper method to generalize that kind of logic; you would just specify a boolean function, the maximum time to wait for the condition (this timeout can be pretty long, like multiple seconds, since it's just to ensure that the test won't hang forever if it fails and if it does fail it's OK for the test run to be slow), how often to poll the condition, and a function to generate a nice failure message. Our tests are currently using github.com/stretchr/testify which already has a similar thing built in, but doing it from scratch would look something like this:

func waitForCondition(timeout time.Duration, pollInterval time.Duration,
		condition func() bool, failMessage func() string) {
	timeoutCh := time.After(timeout)
	pollTicker := time.NewTicker(pollInterval)
	defer pollTicker.Stop()
	for {
		select {
		case <-timeoutCh:
			Fail(failMessage())
		case <-pollTicker.C:
			if condition() {
				return
			}
		}
	}
}

And then for the specific case of waiting for the dropped count to change (this version only works for Cache, but you could make it usable for LayeredCache too using an interface):

func waitForDroppedCount(cache *Cache, expectedCount int) {
	var lastCount int
	waitForCondition(time.Second*2, time.Millisecond*5,
		func() bool {
			lastCount = cache.GetDropped()
			return lastCount == expectedCount
		},
		func() string {
			return fmt.Sprintf("timed out waiting for dropped count to be %d - last count was %d",
				expectedCount, lastCount)
		})
}

I don't think that is quite right - I just tried it out, and got a timeout I didn't expect, which is hard to diagnose currently because Fail() doesn't seem to actually print the failure message or the location of the failure (I probably just don't understand how to use the expect framework). But that's basically how we've avoided timing-based test flakiness for asynchronous things in the past.

eli-darkly avatar Mar 18 '21 19:03 eli-darkly

I had a thought about reproducing the Stop problem, although I don't quite know how to apply the thought yet.

The specific condition the race detector is complaining about seems to be the one that's documented here, where a close(ch) happens after there's been a send on ch but before there's been a receive on ch. Here's an abbreviated version of the traces:

WARNING: DATA RACE
Write at 0x00c000308eb0 by goroutine 45:
  runtime.closechan()
  github.com/launchdarkly/ccache.(*Cache).Stop()

Previous read at 0x00c000308eb0 by goroutine 46:
  runtime.chansend()
  github.com/launchdarkly/ccache.(*Cache).promote()
  github.com/launchdarkly/ccache.(*Cache).Get()

So I'm guessing that the failure depends on the following conditions:

  1. an existing item is updated or deleted
  2. a value is pushed onto promotables or deletables
  3. the test ends and Stop is called before the worker goroutine has a chance to wake up and receive the value from promotables or deletables

If that's correct, then I would expect to be able to reproduce the issue better if worker() were made artificially slow, for instance by adding time.Sleep(time.Millisecond*20) immediately before the select on line 229. However, currently I can't test that theory because so much of the test logic is currently very timing-dependent due to the use of arbitrary small sleeps as we were just discussing, so slowing anything down just breaks everything.

If I'm correctly reading the race detector docs I linked to (though I'm far from certain about that), one solution might be to add non-blocking receives on both promotables and deletables within the Stop method before it closes those channels. If there aren't any values in those channels, that doesn't do any harm, and it satisfies the race detector's wish for a known ordering of events. If there are any values, then it'll just be discarding a value that we wouldn't be doing anything with anyway (since we're about to stop the worker anyway).

But of course it won't be possible to verify that that's a fix until it's possible to reproduce the failure.

eli-darkly avatar Mar 18 '21 19:03 eli-darkly

Sorry, looking at that trace again from my previous comment, I don't think that is relevant. That's a different problem, one that is specific to my project— it was trying to read from the cache on a different goroutine than the one that was closing the cache, which of course is no good since there's no guarantee that the cache wouldn't have already been closed. That part is easy to fix in my own code, but I don't think that could have been the thing that was failing yesterday in the ccache tests. I wish I had saved the stacktraces from those. For what it's worth—though I don't seem to be getting much closer to actually having a solution—I think I know why this is so much easier to reproduce in my own project's tests than in the ccache tests (though there must be some way to do it in the latter, because yesterday it was easy).

eli-darkly avatar Mar 18 '21 19:03 eli-darkly

And... I'm sorry to leave so many comments, but this has made me think more about the concurrency issues around Stop, and I'm realizing that ccache is not as convenient for my use cases as I'd hoped.

That is: ccache operations can panic if you've already called Stop, and I know that you've documented that fact in the comments on the Stop method, but it's a non-trivial usability issue in any situation like library code where the number of active goroutines that might be calling into this code isn't under my control. That is, even though the cache is thread-safe in terms of concurrent gets and sets - its overall state of being safe to use is not thread-safe; any goroutine that calls Stop will blow up any others that are currently in a code path that touches the cache. I think this means I'll have to have a RWMutex and put a read lock around all of my calls to Get or Set, and a write lock around my call to Stop. That should prevent panics, but it's an extra layer of synchronization around calls that already contain synchronization themselves, which is kind of unfortunate (not just because of the slight overhead of an extra RLock(), but because it's nice to be able to stick to "don't put a lock around anything that might itself need to wait on a lock" whenever possible on general principle).

I mean, the underlying issue is the possibility of sending on a closed channel, and that's something people already have to worry about if they're writing any Go code that maintains a channel— there are well-defined patterns for that. It's just that one doesn't normally think of a Get on a thread-safe cache as involving a channel send, and it's common to use caches in ways that don't go well with those patterns.

So, I'm wondering if it might be possible to rearrange anything inside ccache to let it leverage its existing synchronization logic to make this more safe. If I can think of anything, I'll submit a PR, but this may just be a futile daydream.

eli-darkly avatar Mar 18 '21 21:03 eli-darkly

Regarding the test sync issues, I have a solution that might be better than the polling idea - I'll submit a PR.

eli-darkly avatar Mar 18 '21 22:03 eli-darkly

https://github.com/karlseguin/ccache/pull/61

Stop no longer works by closing promotable. Instead, it uses a control message (control messages are relatively new, which I guess is why I didn't use it in the first place). The major downside to this is that instead of panic'ing, concurrent operations could block forever (because the worker is no longer receiving). To compensate:

1 - Stop is delayed (configurable time), to let concurrent work finish 2 - Control channel is now buffered 3 - Promotable is already using a select/default->noop

Still, I'm not sure this is acceptable. I rather have it panic than hang. Further, when Stop is being used, "Control commands", which will now be documented as such, should be called sparingly - although this is generally a good idea anyways.

As an aside, the tests no longer cheat. There's now a public GC and Size function, which use the control channel. This helps eliminate test-specific race issues.

karlseguin avatar Mar 19 '21 03:03 karlseguin

Took another stab at this. Clean up control some. Don't rely on closing channels to stop. https://github.com/karlseguin/ccache/pull/63

karlseguin avatar Mar 20 '21 16:03 karlseguin

I hadn't revisited this in a long time, so I tried updating my code to use the latest release of ccache (nice to have generics now) and I found that the problem is still there. I see that you've had a PR open since last year, which I hadn't seen at the time... unfortunately it's now out of date due to upstream changes.

eli-darkly avatar Jan 03 '23 19:01 eli-darkly

I re-created the control_stop branch to be mergeable on top of the current master, and created a new PR: https://github.com/karlseguin/ccache/pull/78

Can give them a spin?

karlseguin avatar Jan 04 '23 02:01 karlseguin