[BUG] Loader of the same key is called multiple times
Description
Multiple concurrent calls to Get method of the same key can sometimes result in multiple load calls.
To Reproduce
Run the following code:
package main
import (
"context"
"github.com/maypok86/otter/v2"
"sync"
"sync/atomic"
"time"
)
func main() {
cache := otter.Must(&otter.Options[string, int]{
MaximumSize: 1000,
ExpiryCalculator: otter.ExpiryWriting[string, int](5 * time.Minute),
OnAtomicDeletion: func(e otter.DeletionEvent[string, int]) {
println(e.Cause.String())
},
})
loader := otter.LoaderFunc[string, int](func(ctx context.Context, key string) (int, error) {
defer println("Loaded")
_, ok := cache.GetIfPresent(key)
println("Key exists: ", ok)
println("Loading...")
return 1, nil
})
// Simulate concurrent load.
ctx := context.Background()
var wg sync.WaitGroup
var sum atomic.Int64
runs := 100_000
for i := 0; i < runs; i++ {
wg.Add(1)
go func() {
defer wg.Done()
i, err := cache.Get(ctx, "key", loader)
if err != nil {
panic("wtf?")
}
sum.Add(int64(i))
}()
}
wg.Wait()
if sum.Load() != int64(runs) {
panic("wtf???")
}
// Expect "Loaded" to be printed once, however, sometimes it is printed multiple times:
// Key exists: false
// Loading...
// Loaded
// Key exists: false
// Loading...
// Loaded
// Replacement
// Key exists: true
// Loading...
// Loaded
// Replacement
// Key exists: true
// Loading...
// Loaded
// Replacement
}
Expected behavior
Multiple calls to Get method of the same key should load the key only once.
Environment
- OS: Win 11
- Go 1.25.1
Additional context
Increase the number of goroutine (runs) can make the bug happens more frequent.
Hi, it seems like this is quite expected behavior. The thing is, Get has similar semantics to singleflight, and since your example launches a lot of goroutines that perform a very small amount of work, some might start their execution at the moment when other goroutines have already finished. As a result, Get has no way of knowing that the goroutines were launched "simultaneously".
Here's your example, for instance, where otter is replaced with singleflight:
package main
import (
"golang.org/x/sync/singleflight"
"sync"
"sync/atomic"
)
func main() {
var (
s singleflight.Group
mutex sync.Mutex
numCalls atomic.Int64
m = make(map[string]any)
key = "key"
value = 1
)
loader := func() (any, error) {
defer println("Loaded")
mutex.Lock()
_, ok := m[key]
if !ok {
m[key] = value
}
mutex.Unlock()
println("Key exists: ", ok)
println("Loading...")
numCalls.Add(1)
return value, nil
}
// Simulate concurrent load.
var wg sync.WaitGroup
var sum atomic.Int64
runs := 100_000
for i := 0; i < runs; i++ {
wg.Add(1)
go func() {
defer wg.Done()
i, err, _ := s.Do(key, loader)
if err != nil {
panic("wtf?")
}
sum.Add(int64(i.(int)))
}()
}
wg.Wait()
if sum.Load() != int64(runs) {
panic("wtf???")
}
println(numCalls.Load(), sum.Load())
}
If you add a small delay (time.Sleep) to simulate a long computation, everything will work as you expect.
Thanks for your detailed explanation.
That behavior is expected in the case of singleflight.
But in the case of caching, I will expect the goroutines that perform the loading to set the value in the cache before
it returns, so the other goroutines that call Get will not need to perform the loading again.
Getcalls during the load will wait for the load to finish.Getcalls after the load finished should already see the result in the cache so it never needs to touch the loader.
In fact, if you replace the Get with Compute for loading, then everything will work as expected.
//...
wg.Add(1)
go func() {
defer wg.Done()
i, _ := cache.Compute("key", func(oldValue int, found bool) (newValue int, op otter.ComputeOp) {
if !found {
// If key not exist then execute loader.
v, _ := loader(ctx, "key")
return v, otter.WriteOp
}
return oldValue, otter.CancelOp
})
sum.Add(int64(i))
}()
//...
About delay: In our internal test where I found out this behavior, the loader actually has 100ms sleep and still gets called multiple times.
Actually, I don't know of any cache in Go that has cache stampede protection in the exact form you want. And the reason is quite simple.
Typically, cache.Get(ctx, key, loader) is implemented roughly as follows:
type Cache[K comparable, V any] struct {
data sync.Map
singleflight singleflight.Group
}
// ...
func (c *Cache[K, V]) Get(ctx context.Context, key K, loader otter.Loader[K, V]) (V, error) {
value, ok := c.data.Load(key)
if ok {
return value.(V), nil
}
// Some goroutine might reach here after singleflight.Do has already been executed for the required key, and in that case the loader will be called multiple times.
value, err, _ := c.singleflight.Do(key, func() (any, error) {
return loader.Load(ctx, key)
})
return value.(V), err
}
Why is it done this way and can it be fixed? To be honest, it's complicated.
In theory, one could try to check for the presence of the item in the cache before calling the loader (similar to how sync.Once works), but this leads to the problem of needing to acquire a lock on the entry for that specific key. Otherwise, the situation where the loader gets called twice would still be possible.
One could use a Compute method, as you have done, but this also has its issues:
- Compute always locks the entire bucket, which slows down retrievals for cache hits.
- Since it locks a hash table bucket, performance issues can arise if several cache misses happen to fall into the same bucket simultaneously. Yes, only a single bucket is locked, not an entire large shard as would be the case in other implementations. However, there's still a chance that a bucket contains, say, 4 entries, and if two cache misses for keys in that bucket occur at the same time, their loaders will be executed sequentially.
In the case of BulkLoad, solving such problems would require building an orchestrator of extreme complexity.
Well, to be honest, it's quite difficult to actually encounter the problems with singleflight or Compute in real life.
fwiw, Caffeine’s get(key, loader) does an optimistic getIfPresent that falls back to a locking compute. That has the unfortunate collision behavior that you mentioned, so long loads can halt eviction or delay other loads.
The simplest way to minimize that is to set a higher initialCapacity for the hash table (unrelated to the max size) to increase the hash bins used for locking. Alternatively, the async cache stores the future as the cached value, which allows for per entry locking and stampede protection of bulk loads. It’s a messy set of tradeoffs that surprises users and there’s no perfect solution.
Another quirk is when users think computes are transactions so they update other mappings as well. That can lead to deadlocks, livelocks, and hash table corruption. Detecting this at runtime is often too expensive and you’re back to hoping to make the common case good enough and warning about exceptional usages.
If I understand correctly, the singleflight should already act as a lock on the key,
so the Compute approach can be replaced with this:
key := "key"
if value, ok := cache.GetIfPresent(key); ok {
// Use the value if is already in the cache.
sum.Add(int64(value))
return
}
value, err, _ := singleflight.Do(key, func() (any, error) {
// Someone else loaded the value.
if value, ok := cache.GetIfPresent(key); ok {
return value, nil
}
// Load and set.
value, err := loader.Load(ctx, key)
if err != nil {
return value, err
}
cache.Set(key, value)
return value, nil
})
if err != nil {
panic("wtf?")
}
// Use the value.
sum.Add(int64(value.(int)))
Tested. This also works, not sure about the performance though.
It seems this code might lead to race conditions, as a user could call Get and Set simultaneously or call Delete during a Get operation.
Although... otter already knows how to handle these issues. It might actually work. I'll try testing it out on Monday.
Sorry I didn't write right away. Unfortunately, there are some serious issues, and the mechanism Otter uses to prevent races with singleflight is really getting in the way of strengthening the loader call guarantees :(. I'll try to make a few more attempts at a solution this week, but it's all very tricky at the moment.