otter icon indicating copy to clipboard operation
otter copied to clipboard

[BUG] Loader of the same key is called multiple times

Open mawngo opened this issue 4 months ago • 9 comments

Description

Multiple concurrent calls to Get method of the same key can sometimes result in multiple load calls.

To Reproduce

Run the following code:

package main

import (
	"context"
	"github.com/maypok86/otter/v2"
	"sync"
	"sync/atomic"
	"time"
)

func main() {
	cache := otter.Must(&otter.Options[string, int]{
		MaximumSize:      1000,
		ExpiryCalculator: otter.ExpiryWriting[string, int](5 * time.Minute),
		OnAtomicDeletion: func(e otter.DeletionEvent[string, int]) {
			println(e.Cause.String())
		},
	})
        
        
	loader := otter.LoaderFunc[string, int](func(ctx context.Context, key string) (int, error) {
		defer println("Loaded")
		_, ok := cache.GetIfPresent(key)
		println("Key exists: ", ok)
		println("Loading...")
		return 1, nil
	})

	// Simulate concurrent load.
	ctx := context.Background()
	var wg sync.WaitGroup
	var sum atomic.Int64
	runs := 100_000
	for i := 0; i < runs; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			i, err := cache.Get(ctx, "key", loader)
			if err != nil {
				panic("wtf?")
			}
			sum.Add(int64(i))
		}()
	}
	wg.Wait()

	if sum.Load() != int64(runs) {
		panic("wtf???")
	}

	// Expect "Loaded" to be printed once, however, sometimes it is printed multiple times:
	// Key exists:  false
	// Loading...
	// Loaded
	// Key exists:  false
	// Loading...
	// Loaded
	// Replacement
	// Key exists:  true
	// Loading...
	// Loaded
	// Replacement
	// Key exists:  true
	// Loading...
	// Loaded
	// Replacement
}

Expected behavior

Multiple calls to Get method of the same key should load the key only once.

Environment

  • OS: Win 11
  • Go 1.25.1

Additional context

Increase the number of goroutine (runs) can make the bug happens more frequent.

mawngo avatar Sep 06 '25 15:09 mawngo

Hi, it seems like this is quite expected behavior. The thing is, Get has similar semantics to singleflight, and since your example launches a lot of goroutines that perform a very small amount of work, some might start their execution at the moment when other goroutines have already finished. As a result, Get has no way of knowing that the goroutines were launched "simultaneously".

Here's your example, for instance, where otter is replaced with singleflight:

package main

import (
	"golang.org/x/sync/singleflight"
	"sync"
	"sync/atomic"
)

func main() {
	var (
		s        singleflight.Group
		mutex    sync.Mutex
		numCalls atomic.Int64
		m        = make(map[string]any)
		key      = "key"
		value    = 1
	)

	loader := func() (any, error) {
		defer println("Loaded")

		mutex.Lock()
		_, ok := m[key]
		if !ok {
			m[key] = value
		}
		mutex.Unlock()
		println("Key exists: ", ok)
		println("Loading...")
		numCalls.Add(1)
		return value, nil
	}

	// Simulate concurrent load.
	var wg sync.WaitGroup
	var sum atomic.Int64
	runs := 100_000
	for i := 0; i < runs; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			i, err, _ := s.Do(key, loader)
			if err != nil {
				panic("wtf?")
			}
			sum.Add(int64(i.(int)))
		}()
	}
	wg.Wait()

	if sum.Load() != int64(runs) {
		panic("wtf???")
	}

	println(numCalls.Load(), sum.Load())
}

maypok86 avatar Sep 07 '25 06:09 maypok86

If you add a small delay (time.Sleep) to simulate a long computation, everything will work as you expect.

maypok86 avatar Sep 07 '25 06:09 maypok86

Thanks for your detailed explanation.

That behavior is expected in the case of singleflight.

But in the case of caching, I will expect the goroutines that perform the loading to set the value in the cache before it returns, so the other goroutines that call Get will not need to perform the loading again.

  • Get calls during the load will wait for the load to finish.
  • Get calls after the load finished should already see the result in the cache so it never needs to touch the loader.

In fact, if you replace the Get with Compute for loading, then everything will work as expected.

    //...
    wg.Add(1)
    go func() {
	    defer wg.Done()
	    i, _ := cache.Compute("key", func(oldValue int, found bool) (newValue int, op otter.ComputeOp) {
		    if !found {
				// If key not exist then execute loader.
			    v, _ := loader(ctx, "key")
			    return v, otter.WriteOp
		    }
		    return oldValue, otter.CancelOp
	    })
	    sum.Add(int64(i))
    }()
    //...

About delay: In our internal test where I found out this behavior, the loader actually has 100ms sleep and still gets called multiple times.

mawngo avatar Sep 07 '25 08:09 mawngo

Actually, I don't know of any cache in Go that has cache stampede protection in the exact form you want. And the reason is quite simple.

Typically, cache.Get(ctx, key, loader) is implemented roughly as follows:

type Cache[K comparable, V any] struct {
	data         sync.Map
	singleflight singleflight.Group
}

// ...

func (c *Cache[K, V]) Get(ctx context.Context, key K, loader otter.Loader[K, V]) (V, error) {
	value, ok := c.data.Load(key)
	if ok {
		return value.(V), nil
	}

	// Some goroutine might reach here after singleflight.Do has already been executed for the required key, and in that case the loader will be called multiple times.
	value, err, _ := c.singleflight.Do(key, func() (any, error) {
		return loader.Load(ctx, key)
	})
	return value.(V), err
}

Why is it done this way and can it be fixed? To be honest, it's complicated.

In theory, one could try to check for the presence of the item in the cache before calling the loader (similar to how sync.Once works), but this leads to the problem of needing to acquire a lock on the entry for that specific key. Otherwise, the situation where the loader gets called twice would still be possible.

One could use a Compute method, as you have done, but this also has its issues:

  1. Compute always locks the entire bucket, which slows down retrievals for cache hits.
  2. Since it locks a hash table bucket, performance issues can arise if several cache misses happen to fall into the same bucket simultaneously. Yes, only a single bucket is locked, not an entire large shard as would be the case in other implementations. However, there's still a chance that a bucket contains, say, 4 entries, and if two cache misses for keys in that bucket occur at the same time, their loaders will be executed sequentially.

In the case of BulkLoad, solving such problems would require building an orchestrator of extreme complexity.

maypok86 avatar Sep 07 '25 09:09 maypok86

Well, to be honest, it's quite difficult to actually encounter the problems with singleflight or Compute in real life.

maypok86 avatar Sep 07 '25 09:09 maypok86

fwiw, Caffeine’s get(key, loader) does an optimistic getIfPresent that falls back to a locking compute. That has the unfortunate collision behavior that you mentioned, so long loads can halt eviction or delay other loads.

The simplest way to minimize that is to set a higher initialCapacity for the hash table (unrelated to the max size) to increase the hash bins used for locking. Alternatively, the async cache stores the future as the cached value, which allows for per entry locking and stampede protection of bulk loads. It’s a messy set of tradeoffs that surprises users and there’s no perfect solution.

Another quirk is when users think computes are transactions so they update other mappings as well. That can lead to deadlocks, livelocks, and hash table corruption. Detecting this at runtime is often too expensive and you’re back to hoping to make the common case good enough and warning about exceptional usages.

ben-manes avatar Sep 07 '25 10:09 ben-manes

If I understand correctly, the singleflight should already act as a lock on the key, so the Compute approach can be replaced with this:

key := "key"
if value, ok := cache.GetIfPresent(key); ok {
	// Use the value if is already in the cache.
	sum.Add(int64(value))
	return
}

value, err, _ := singleflight.Do(key, func() (any, error) {
	// Someone else loaded the value.
	if value, ok := cache.GetIfPresent(key); ok {
		return value, nil
	}
	// Load and set.
	value, err := loader.Load(ctx, key)
	if err != nil {
		return value, err
	}
	cache.Set(key, value)
	return value, nil
})

if err != nil {
	panic("wtf?")
}
// Use the value.
sum.Add(int64(value.(int)))

Tested. This also works, not sure about the performance though.

mawngo avatar Sep 07 '25 10:09 mawngo

It seems this code might lead to race conditions, as a user could call Get and Set simultaneously or call Delete during a Get operation.

Although... otter already knows how to handle these issues. It might actually work. I'll try testing it out on Monday.

maypok86 avatar Sep 07 '25 10:09 maypok86

Sorry I didn't write right away. Unfortunately, there are some serious issues, and the mechanism Otter uses to prevent races with singleflight is really getting in the way of strengthening the loader call guarantees :(. I'll try to make a few more attempts at a solution this week, but it's all very tricky at the moment.

maypok86 avatar Sep 12 '25 07:09 maypok86