otter icon indicating copy to clipboard operation
otter copied to clipboard

[BUG] Routine leak after cache.Close()

Open equals215 opened this issue 10 months ago • 9 comments

Description

Even after calling Close() on a cache, goleak detects 4 routines still alive on the stack :

goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 37 in state select, with github.com/maypok86/otter/internal/unixtime.startTimer.func1 on top of the stack:
github.com/maypok86/otter/internal/unixtime.startTimer.func1()
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:43 +0xd0
created by github.com/maypok86/otter/internal/unixtime.startTimer in goroutine 35
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:39 +0xa8
 Goroutine 38 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
        /opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
 Goroutine 39 in state sync.Cond.Wait, with sync.runtime_notifyListWait on top of the stack:
sync.runtime_notifyListWait(0x14000246018, 0x0)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:587 +0x154
sync.(*Cond).Wait(0x14000246008)
        /opt/homebrew/opt/go/libexec/src/sync/cond.go:71 +0xcc
github.com/maypok86/otter/internal/queue.(*Growable[...]).Pop(0x100e69d60)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/queue/growable.go:71 +0x9c
github.com/maypok86/otter/internal/core.(*Cache[...]).process(0x100e6e240)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:459 +0x44
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:191 +0x6f0
 Goroutine 40 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
        /opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
]

This is caused by the lack of synchronization (or wait) when closing different routines.

unixtime/startTimer() a.k.a Goroutine 37

func startTimer() {
	done = make(chan struct{})
	atomic.StoreInt64(&startTime, time.Now().Unix())
	atomic.StoreUint32(&now, uint32(0))

	go func() {
		ticker := time.NewTicker(time.Second)
		defer ticker.Stop()
		for {
			select {
			case t := <-ticker.C:
				//nolint:gosec // there will never be an overflow
				atomic.StoreUint32(&now, uint32(t.Unix()-StartTime()))
			case <-done:
				return
			}
		}
	}()
}

A struct{}{} is sent to the channel to signal the routine to stop, but the caller can't check nor wait that the routine did close.
Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)

core/cleanup() a.k.a Goroutine 38 && Goroutine 40

func (c *Cache[K, V]) cleanup() {
	for {
		time.Sleep(time.Second)

		c.evictionMutex.Lock()
		if c.isClosed {
			c.evictionMutex.Unlock()
			return
		}

		c.expiryPolicy.DeleteExpired()

		c.evictionMutex.Unlock()
	}
}

This routine called in NewCache() by go cache.cleanup() can still be sleeping while Close() is called and the program exits.

queue/Pop() a.k.a Goroutine 39

func (g *Growable[T]) Pop() T {
	g.mutex.Lock()
	for g.count == 0 {
		g.notEmpty.Wait()
	}
	item := g.pop()
	g.mutex.Unlock()
	return item
}

Is stuck while waiting for sync.Cond signal.

To Reproduce

Steps to reproduce the behavior:

cache, err := otter.MustBuilder[string, net.IP](1000).
	WithTTL(1 * time.Hour).
	Build()
if err != nil {
	t.Fatal(err)
}

cache.Close()

Will make only the cleanup() leak show up. On the package I maintain, the cache is held in a struct and closed after any access to it can be done and it leaks on every unit test

Expected behavior

No goroutine leaks.

Environment

Personal dev setup

  • OS: MacOS 14.6.1 (23G93)
  • ARCH: arm64
  • CPU: Apple M1 Ultra

and

Github Actions environment

  • OS: Ubuntu 24.04
  • ARCH: amd64
  • CPU: AMD EPYC 7763 64-Core Processor

equals215 avatar Feb 17 '25 22:02 equals215

Hi, I know about one of the cases, but I don't know how to reproduce the other two (and they look extremely strange).

But let's take it in order.

unixtime/startTimer() a.k.a Goroutine 37

A struct{}{} is sent to the channel to signal the routine to stop, but the caller can't check nor wait that the routine did close.

I disagree, the caller is doing literally this.

Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)

To be honest, I've never seen anyone do this, as it just complicates the code without any real benefit.

It is also worth bearing in mind that the timer is shared between multiple cache instances.

core/cleanup() a.k.a Goroutine 38 && Goroutine 40

It already looks like the truth. This is because a simple time.Sleep is used and the goroutine closes a bit later. You can check this by simply adding a delay before checking for leaks. I discussed this once, and in the end we agreed that there is no fundamental benefit from adding more accurate synchronization, since the GC works lazily, usually starting every 2 minutes. If this is a problem for you, then I can add more precise synchronization.

queue/Pop() a.k.a Goroutine 39

This is not possible at all, since this goroutine is completed before the end of the Close operation.

P.S. I've looked at your code a bit and it seems you just don't call Close anywhere.

maypok86 avatar Feb 18 '25 18:02 maypok86

unixtime/startTimer()

I disagree, the caller is doing literally this

// Stop should be called when closing and stopping the cache instance to stop the timer.
func Stop() {
	mutex.Lock()
	defer mutex.Unlock()

	countInstance--
	if countInstance == 0 {
		done <- struct{}{}
		close(done)
	}
}

Can't see where Stop() checks that startTimer()'s anonymous routine successfully returned. I might be missing something.

It is also worth bearing in mind that the timer is shared between multiple cache instances.

Wasn't aware of that.

core/cleanup()

You can check this by simply adding a delay before checking for leaks.

Tried with 5s sleep after closing the cache and before returning from the test, the leaks still happen :

goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 21 in state select, with github.com/maypok86/otter/internal/unixtime.startTimer.func1 on top of the stack:
github.com/maypok86/otter/internal/unixtime.startTimer.func1()
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:43 +0xd0
created by github.com/maypok86/otter/internal/unixtime.startTimer in goroutine 19
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:39 +0xa8
 Goroutine 24 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
        /opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 19
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
 Goroutine 25 in state sync.Cond.Wait, with sync.runtime_notifyListWait on top of the stack:
sync.runtime_notifyListWait(0x140002320d8, 0x0)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:587 +0x154
sync.(*Cond).Wait(0x140002320c8)
        /opt/homebrew/opt/go/libexec/src/sync/cond.go:71 +0xcc
github.com/maypok86/otter/internal/queue.(*Growable[...]).Pop(0x100721e20)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/queue/growable.go:71 +0x9c
github.com/maypok86/otter/internal/core.(*Cache[...]).process(0x100726300)
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:459 +0x44
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 19
        /Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:191 +0x6f0
]

If this is a problem for you, then I can add more precise synchronization.

Well I'm just extra cautious with routine leak and so are my CI tests, the cache is ringing all the bells :/ If it's not too time-consuming for you, I'm all for it, or I can work a PR on synchronizing the routines that leak

queue/Pop()

This is not possible at all, since this goroutine is completed before the end of the Close operation.

It happens on my unit tests which :

  • call 1 Get() then 1 Set()
  • call 1 Get() and no Set()
  • no Get() or Set() It suggests that maybe the routine will stay stuck waiting on sync.Cond signal if no activity is done OR if the duration of the test is too short.

P.S. I've looked at your code a bit and it seems you just don't call Close anywhere.

The architecture of this library is literally hell on earth but FWIW here's how the cache is closed :

  1. CustomHTTPClient.Close() is called either by the client or in the dns_tests.go by the cleanup() function returned by setup()
  2. c.closeDNSCache() is called at the end of CustomHTTPClient.Close()
  3. customDialer.DNSRecords.Close() (Cache.Close()) is then called by c.closeDNSCache()

c.closeDNSCache() is a function variable made in NewWARCWritingHTTPClient() that calls the Cache.Close() It has been made that way cause the Cache is held in a custom dialer to act as a DNS cache for the library ; being held in the dialer we lose it's reference after the dialer is used to make a custom transport that is then held in the CustomHTTPClient. It's chaotic but using a debugger with a breakpoint on c.closeDNSCache() I can see the call being made, it's how I first troubleshooted the issue.

equals215 avatar Feb 18 '25 19:02 equals215

Can you reproduce the 2 cases you didn't knew about with ? :

cache, err := otter.MustBuilder[string, net.IP](1000).
	WithTTL(1 * time.Hour).
	Build()
if err != nil {
	t.Fatal(err)
}

cache.Close()

equals215 avatar Feb 18 '25 19:02 equals215

It suggests that maybe the routine will stay stuck waiting on sync.Cond signal if no activity is done OR if the duration of the test is too short.

The problem is that if Close completes, then queue.Pop simply cannot be executed. This means that the cache has continued to live...

Can you reproduce the 2 cases you didn't knew about with ? :

Unfortunately, no.

maypok86 avatar Feb 18 '25 20:02 maypok86

@equals215 Lol, you seem to be creating two cache instances.

maypok86 avatar Feb 18 '25 20:02 maypok86

Didn't realized that... I just plugged my stuff to the existing tests, but wtf are those 😅

equals215 avatar Feb 18 '25 20:02 equals215

Yep, so the 2 cache instances (of which only 1 is closed) are responsible for the 2 "unknown leaks" — which aren't leaks. Which only leaves us with the unixtime/startTimer() that you know about. Sorry about the loss of your time

equals215 avatar Feb 18 '25 20:02 equals215

@equals215

Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)

This is not correct, from go spec: https://go.dev/ref/spec#Select_statements

If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed.

So case chosen randomly.

mikhailv avatar Feb 26 '25 13:02 mikhailv

I stand corrected, I checked what I assumed to be true shortly after and saw that it was pseudo-random and not ordered/prioritized.

equals215 avatar Feb 26 '25 13:02 equals215

I believe this issue is resolved. If that's not the case, please let me know.

maypok86 avatar Jun 18 '25 09:06 maypok86