proposal: sync: add PLocalCache
Proposal Details
The issue
High-performance code which scales linearly with the number of CPU cores usually needs per-CPU caches for holding some per-CPU state in order to avoid costly inter-CPU synchronization. The state can be re-computed at any time, but the computation may take additional CPU time and other resources, so it is more efficient to cache the computed state per each CPU core and then re-use it.
The sync.Pool can be used as per-CPU cache, but it has the following issues in this context:
-
sync.Pool.Get()tries stealing an object from other CPUs if the object is missing in the current P. This leads to costly inter-CPU synchronization. The cost of this synchronization increases with the number of available CPU cores. -
sync.Pool.Put()may store multiple objects at the same P. This leads to excess memory usage when at most one object is needed per P.sync.Pool.Put()also triggers expensive inter-CPU synchronization if P already contains an object. -
sync.Poolmay drop cached objects at every GC cycle, so the caller needs to spend additional CPU time for re-creating the object.
The solution
To add sync.PLocalCache struct with the following API:
// PLocalCache caches per-P objects.
//
// Every P may have at most one object at any moment.
//
// PLocalCache is useful for implementing linearly scalable
// CPU-bound algorithms on multi-CPU architectures.
type PLocalCache struct { ... }
// Get returns P-local object from c.
//
// nil is returned if there is no cached object for the current P.
//
// It is guaranteed that the returned object can be accessed only by the current goroutine,
// so there is no need in synchronizing access to it.
//
// Return the object to the cache via Put() if it needs to be re-used next time.
func (c *PLocalCache) Get() any { ... }
// Put puts v to P-local storage at c.
//
// v mustn't be used after Put() call.
//
// There is no guarantee that the subsequent Get() call returns v.
func (c *PLocalCache) Put(v any) { ... }
Implementation details
sync.PLocalCache may be implemented in the way similar to sync.Pool, but without the following abilities:
- Stealing objects from other Ps. If the object is missing in P-local storage, then just return nil.
- Ability to put multiple objects per every P-local storage. If
Put()is called on the storage with already existing P-local object, then just ignore the new object. - Periodic cleanup on every GC cycle. It is guaranteed that the number of cached objects in
sync.PLocalCachedoesn't exceedGOMAXPROCS, e.g. it is bounded, and it is expected that the user continuously accesses the cached objects. So there is little sense in periodic cleanup of the cache. All the cached objects will be removed after the correspondingsync.PLocalCacheis destroyed by garbage collector.
The property of having at most one P-local object in the cache narrows down the applicability of the Get() ... Put() to CPU-bound code without context switches (e.g. without IO, expensive syscalls and CGO calls). This minimizes chances of context switch during the execution of the code between Get() and Put(), so the cached objects will be successfully re-used by this code. For example, it is great to use sync.PLocalCache for scalable random number generator with per-P (aka per-CPU) state. It is also great to use sync.PLocalCache for various CPU-bound parsers, encoders and compressors with some non-trivial state, which can be cached in the P-local cache.
On the other hand, if the chances of context switch between Get() and Put() calls are high, then this increases chances that Get() will return nil most of the time. This forces the user's code to spend additional CPU time on object re-creation. The re-created object will be dropped most of the time on Put() call, since there are high chances that there is another P-local object is put in the cache by concurrently running goroutines. In such cases it is better to use sync.Pool instead of sync.PLocalCache.
Example usage
type ScalableParser struct {
state sync.PLocalCache
}
func (p *ScalableParser) Parse(s string) result {
v := p.state.Get()
if v == nil {
v = newScalableParserState()
}
ps := v.(*scalableParserState)
r := ps.Parse(s)
p.state.Put(v)
return r
}
See also https://github.com/golang/go/issues/65104 . Now I think it is better to provide a separate entity with clear semantics than complicating the semantics of sync.Pool and/or trying to efficiently cover multiple different cases with sync.Pool.
Generics
It may be good providing generic-based sync.PLocalCache[T any], but it is OK to provide non-generic implementation to be consistent with sync.Pool.
Related Issues and Documentation
- proposal: sync: add GetLocal() method to `sync.Pool` in order to return only P-local objects #65104
- sync: per-cpu storage #8281
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Sounds like a dup of #8281.
This idea is interesting, but I think it stops just short of being able to cover many more use-cases, without much of an increase in complexity.
For example, with a hook into where two values are in conflict (similar to the Merge method here: https://github.com/golang/go/issues/18802#issuecomment-1884012504), and iteration over all the values, we can support additional use-cases like scalable counters and logging (locally cache buffers, and on conflict, just flush one of them).
Some other notes:
- I don't think we should make the concept of a
Ppart of any public API; it's a concept that's internal to the runtime. I don't have a better name off the top of my head, but I'm sure there's a good one out there. - I don't think there's a reason not to use generics for new APIs where it's appropriate (and I do think it's appropriate here). You can always make
anyyour type parameter if that's what you want.
I think this proposal is closely related to #18802 actually, and not far from where I wanted to explore next with being able to access local values without synchronization.
For example, with a hook into where two values are in conflict (similar to the Merge method here: https://github.com/golang/go/issues/18802#issuecomment-1884012504), and iteration over all the values, we can support additional use-cases like scalable counters and logging (locally cache buffers, and on conflict, just flush one of them).
I'd prefer leaving p-local cache as simple as possible, so it solves the original issue described above - to provide an efficient and cpu-scalable mechanism for caching the state of various CPU-bound parsers and encoders. It is expected that the state may be lost at any time (for example, when GOMAXPROCS changes, when the goroutine is re-scheduled to other P or when somebody forgets returning the state to the cache), so it could be re-constructed when needed.
Other use cases like https://github.com/golang/go/issues/8281 and https://github.com/golang/go/issues/18802 should be covered separately, since they are more complicated and they have no clear solution yet.
This is one of those proposals that's tricky because it sits uncomfortably on the fence between specification and implementation detail.
As mentioned above, the concept of "P" is an implementation detail of the current runtime. Future versions of the runtime or alternative implementations of Go may not have any such concept, or may define it in a different way that makes this specific API harder to support.
In principle it could be defined more generally as a best-effort cache where Get could potentially return any value that was previously passed to Put that hasn't already been returned from another call, or could return nothing at all, and then that would give more room for the implementation details to evolve or change completely in future. That definition also avoids implying to a naive reader that this could be used for anything beyond best-effort caching, such as the adjacent use-cases https://github.com/golang/go/issues/8281, https://github.com/golang/go/issues/73667, https://github.com/golang/go/issues/22476, etc, by avoiding specifying anything about how Get decides which of the previously-put values (if any) to return.
However, since the motivation for this is performance it doesn't seem very satisfying to be too loose in what guarantees are being made here. If someone relies on this to perform well in a specific situation and a future version begins to perform worse in that specific situation then they would consider it to be a regression regardless of what the documentation says.
With all of that said it does seem likely to me that any future implementation of the Go runtime that supports multiple CPUs[^1] would have some idea of spreading work efficiently across those CPUs, so perhaps it would be sufficient to go with a generic definition like I wrote above along with some additional commentary that it's intended to maximize CPU locality somehow, and then future changes to the implementation can be evaluated by whether they still do a good job of CPU locality even if the exact details of what that means need to change.
Overall I think I like the idea of it and my concern is mainly in carefully hedging what exactly the name and documentation imply about how this ought to be used and how its behavior might evolve in future. It does seem a little esoteric in the sense that I expect most Go programs don't need to micromanage things to such an extent, but this also seems like something hard to implement outside of stdlib due to its close interaction with current runtime implementation details. 🤔
[^1]: by which I mean: not one of the more specialized implementations, like TinyGo, which might have quite different goals and use-cases.