[BUG] Routine leak after cache.Close()
Description
Even after calling Close() on a cache, goleak detects 4 routines still alive on the stack :
goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 37 in state select, with github.com/maypok86/otter/internal/unixtime.startTimer.func1 on top of the stack:
github.com/maypok86/otter/internal/unixtime.startTimer.func1()
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:43 +0xd0
created by github.com/maypok86/otter/internal/unixtime.startTimer in goroutine 35
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:39 +0xa8
Goroutine 38 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
/opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
Goroutine 39 in state sync.Cond.Wait, with sync.runtime_notifyListWait on top of the stack:
sync.runtime_notifyListWait(0x14000246018, 0x0)
/opt/homebrew/opt/go/libexec/src/runtime/sema.go:587 +0x154
sync.(*Cond).Wait(0x14000246008)
/opt/homebrew/opt/go/libexec/src/sync/cond.go:71 +0xcc
github.com/maypok86/otter/internal/queue.(*Growable[...]).Pop(0x100e69d60)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/queue/growable.go:71 +0x9c
github.com/maypok86/otter/internal/core.(*Cache[...]).process(0x100e6e240)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:459 +0x44
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:191 +0x6f0
Goroutine 40 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
/opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 35
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
]
This is caused by the lack of synchronization (or wait) when closing different routines.
unixtime/startTimer() a.k.a Goroutine 37
func startTimer() {
done = make(chan struct{})
atomic.StoreInt64(&startTime, time.Now().Unix())
atomic.StoreUint32(&now, uint32(0))
go func() {
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case t := <-ticker.C:
//nolint:gosec // there will never be an overflow
atomic.StoreUint32(&now, uint32(t.Unix()-StartTime()))
case <-done:
return
}
}
}()
}
A struct{}{} is sent to the channel to signal the routine to stop, but the caller can't check nor wait that the routine did close.
Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)
core/cleanup() a.k.a Goroutine 38 && Goroutine 40
func (c *Cache[K, V]) cleanup() {
for {
time.Sleep(time.Second)
c.evictionMutex.Lock()
if c.isClosed {
c.evictionMutex.Unlock()
return
}
c.expiryPolicy.DeleteExpired()
c.evictionMutex.Unlock()
}
}
This routine called in NewCache() by go cache.cleanup() can still be sleeping while Close() is called and the program exits.
queue/Pop() a.k.a Goroutine 39
func (g *Growable[T]) Pop() T {
g.mutex.Lock()
for g.count == 0 {
g.notEmpty.Wait()
}
item := g.pop()
g.mutex.Unlock()
return item
}
Is stuck while waiting for sync.Cond signal.
To Reproduce
Steps to reproduce the behavior:
cache, err := otter.MustBuilder[string, net.IP](1000).
WithTTL(1 * time.Hour).
Build()
if err != nil {
t.Fatal(err)
}
cache.Close()
Will make only the cleanup() leak show up.
On the package I maintain, the cache is held in a struct and closed after any access to it can be done and it leaks on every unit test
Expected behavior
No goroutine leaks.
Environment
Personal dev setup
- OS: MacOS 14.6.1 (23G93)
- ARCH: arm64
- CPU: Apple M1 Ultra
and
Github Actions environment
- OS: Ubuntu 24.04
- ARCH: amd64
- CPU: AMD EPYC 7763 64-Core Processor
Hi, I know about one of the cases, but I don't know how to reproduce the other two (and they look extremely strange).
But let's take it in order.
unixtime/startTimer() a.k.a Goroutine 37
A struct{}{} is sent to the channel to signal the routine to stop, but the caller can't check nor wait that the routine did close.
I disagree, the caller is doing literally this.
Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)
To be honest, I've never seen anyone do this, as it just complicates the code without any real benefit.
It is also worth bearing in mind that the timer is shared between multiple cache instances.
core/cleanup() a.k.a Goroutine 38 && Goroutine 40
It already looks like the truth. This is because a simple time.Sleep is used and the goroutine closes a bit later. You can check this by simply adding a delay before checking for leaks. I discussed this once, and in the end we agreed that there is no fundamental benefit from adding more accurate synchronization, since the GC works lazily, usually starting every 2 minutes. If this is a problem for you, then I can add more precise synchronization.
queue/Pop() a.k.a Goroutine 39
This is not possible at all, since this goroutine is completed before the end of the Close operation.
P.S. I've looked at your code a bit and it seems you just don't call Close anywhere.
unixtime/startTimer()
I disagree, the caller is doing literally this
// Stop should be called when closing and stopping the cache instance to stop the timer.
func Stop() {
mutex.Lock()
defer mutex.Unlock()
countInstance--
if countInstance == 0 {
done <- struct{}{}
close(done)
}
}
Can't see where Stop() checks that startTimer()'s anonymous routine successfully returned. I might be missing something.
It is also worth bearing in mind that the timer is shared between multiple cache instances.
Wasn't aware of that.
core/cleanup()
You can check this by simply adding a delay before checking for leaks.
Tried with 5s sleep after closing the cache and before returning from the test, the leaks still happen :
goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 21 in state select, with github.com/maypok86/otter/internal/unixtime.startTimer.func1 on top of the stack:
github.com/maypok86/otter/internal/unixtime.startTimer.func1()
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:43 +0xd0
created by github.com/maypok86/otter/internal/unixtime.startTimer in goroutine 19
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/unixtime/unixtime.go:39 +0xa8
Goroutine 24 in state sleep, with time.Sleep on top of the stack:
time.Sleep(0x3b9aca00)
/opt/homebrew/opt/go/libexec/src/runtime/time.go:300 +0xe0
github.com/maypok86/otter/internal/core.(*Cache[...]).cleanup(0x0)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:391 +0x28
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 19
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:188 +0x694
Goroutine 25 in state sync.Cond.Wait, with sync.runtime_notifyListWait on top of the stack:
sync.runtime_notifyListWait(0x140002320d8, 0x0)
/opt/homebrew/opt/go/libexec/src/runtime/sema.go:587 +0x154
sync.(*Cond).Wait(0x140002320c8)
/opt/homebrew/opt/go/libexec/src/sync/cond.go:71 +0xcc
github.com/maypok86/otter/internal/queue.(*Growable[...]).Pop(0x100721e20)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/queue/growable.go:71 +0x9c
github.com/maypok86/otter/internal/core.(*Cache[...]).process(0x100726300)
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:459 +0x44
created by github.com/maypok86/otter/internal/core.NewCache[...] in goroutine 19
/Users/thomas/go/pkg/mod/github.com/maypok86/[email protected]/internal/core/cache.go:191 +0x6f0
]
If this is a problem for you, then I can add more precise synchronization.
Well I'm just extra cautious with routine leak and so are my CI tests, the cache is ringing all the bells :/ If it's not too time-consuming for you, I'm all for it, or I can work a PR on synchronizing the routines that leak
queue/Pop()
This is not possible at all, since this goroutine is completed before the end of the Close operation.
It happens on my unit tests which :
- call 1
Get()then 1Set() - call 1
Get()and noSet() - no
Get()orSet()It suggests that maybe the routine will stay stuck waiting onsync.Condsignal if no activity is done OR if the duration of the test is too short.
P.S. I've looked at your code a bit and it seems you just don't call Close anywhere.
The architecture of this library is literally hell on earth but FWIW here's how the cache is closed :
CustomHTTPClient.Close()is called either by the client or in thedns_tests.goby thecleanup()function returned bysetup()c.closeDNSCache()is called at the end ofCustomHTTPClient.Close()customDialer.DNSRecords.Close()(Cache.Close()) is then called byc.closeDNSCache()
c.closeDNSCache() is a function variable made in NewWARCWritingHTTPClient() that calls the Cache.Close()
It has been made that way cause the Cache is held in a custom dialer to act as a DNS cache for the library ; being held in the dialer we lose it's reference after the dialer is used to make a custom transport that is then held in the CustomHTTPClient.
It's chaotic but using a debugger with a breakpoint on c.closeDNSCache() I can see the call being made, it's how I first troubleshooted the issue.
Can you reproduce the 2 cases you didn't knew about with ? :
cache, err := otter.MustBuilder[string, net.IP](1000).
WithTTL(1 * time.Hour).
Build()
if err != nil {
t.Fatal(err)
}
cache.Close()
It suggests that maybe the routine will stay stuck waiting on sync.Cond signal if no activity is done OR if the duration of the test is too short.
The problem is that if Close completes, then queue.Pop simply cannot be executed. This means that the cache has continued to live...
Can you reproduce the 2 cases you didn't knew about with ? :
Unfortunately, no.
@equals215 Lol, you seem to be creating two cache instances.
Didn't realized that... I just plugged my stuff to the existing tests, but wtf are those 😅
Yep, so the 2 cache instances (of which only 1 is closed) are responsible for the 2 "unknown leaks" — which aren't leaks.
Which only leaves us with the unixtime/startTimer() that you know about.
Sorry about the loss of your time
@equals215
Also, the case <- done should always come first in such a for-select loop to ensure priority for termination, avoid unnecessary work and prevent resource leaks by consistency and predictability (returning before anything else)
This is not correct, from go spec: https://go.dev/ref/spec#Select_statements
If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed.
So case chosen randomly.
I stand corrected, I checked what I assumed to be true shortly after and saw that it was pseudo-random and not ordered/prioritized.
I believe this issue is resolved. If that's not the case, please let me know.