yarpc-go icon indicating copy to clipboard operation
yarpc-go copied to clipboard

internal/bufferpool has no restrictions on growth, causing un-preventable OOMs

Open Groxx opened this issue 9 months ago • 11 comments

ToC:

  • What
  • Why
  • What do we do now

# What

Cadence has some somewhat-uncommon call patterns that seem to be causing major memory issues via yarpc when enough planets align:

  • We have many thousands of concurrent requests doing long-polls to acquire work (~1 minute each before it gives up and polls again)
    • These are mostly "small" responses, <0.5MB, and the graphs I'm looking at right now are averaging ~5KB.
  • We have occasional very large requests
    • These are essentially unbounded, but e.g. downloading a large workflow's history can very reasonably exceed tens of megabytes, and our replication-related requests are largely unlimited (they're tuned for throughput, e.g. hundreds of MB might be reasonable).
    • I'm not sure if we have any concrete numbers here, but I've seen single GC cycles release >10GB of memory. "large" can be safely assumed.

Internally, yarpc uses a shared global sync.Pool wrapper in internal/bufferpool: https://github.com/yarpc/yarpc-go/blob/7b06a1133668f3202bd63725db3c4254f0ce3c62/internal/bufferpool/bufferpool.go#L30
and it holds items for the full length of a unary request: https://github.com/yarpc/yarpc-go/blob/42ee4791e3485911edf07cc584d6dfd9ba5f847a/transport/tchannel/handler.go#L199-L200 and https://github.com/yarpc/yarpc-go/blob/7b06a1133668f3202bd63725db3c4254f0ce3c62/transport/grpc/handler.go#L176-L177
(Acquired when a call is received, defer-released after it finishes writing the response)

(Semi-related: I am curious how this behaves with streaming requests, as we are planning to switch to gRPC streaming eventually)

From pprof output, this bufferpool sometimes ends up dominating our inuse heap, and 48GB+ hosts can hit OOMs within about an hour of starting up from a single abnormal caller triggering a bad combination of calls.
A sample pprof during one of these instances shows ~38GB of inuse memory due to this pool, with many byte-slices that are well over 50MB (of the most-commonly-sampled ranges):

Image

(Both in-bound arrows are from call stacks that pass through this bufferpool, not from other stdlib-bytes.Buffer usage. They're just not included in the screenshot because they are rendered very far away)


# Why

For Cadence's purposes this bufferpool-using behavior means "thousands of held buffers for ~1m", and roughly never having "idle" time with few requests where it could release these objects. Poll requests restart themselves essentially immediately in nearly all cases, so our concurrently-running request count has a roughly constant baseline in the tens of thousands.
When combined with our occasional large requests, I believe what's happening is:

  • A small buffer is created to handle a poll
  • This ~never gets GC'd, so when it handles a random "large" request it grows to match (several megabytes)
  • ... because it never gets GC'd, it keeps holding megabytes when it usually handles only a few kilobytes, essentially for the rest of the process's lifetime
  • ... and eventually all long-poll-using buffers go through this cycle, gradually growing them all (randomly).

Some may be freed after a burst of other requests complete of course, but this isn't something we can control and it doesn't seem to be able to keep up in some situations.

This is all probably made worse by sync.Pool's "it takes 2 GC cycles of no use to actually free memory" design, combined with infrequent GCs and lots and lots of requests.

Currently our only mitigation strategy is to restart more frequently, and try to identify the caller and what they're doing that's causing memory issues, and get them to make changes. Often this takes hours or days to fully address.


# What do we do now

I assume this pool CPU-benchmarked quite well compared to no buffer reuse, and it's a common and very reasonable thing to have, but CPU is not the only resource to be concerned about.
I kinda think being able to control this isn't optional, since the alternative is "the process does not continue running".

From a bit of exploring of the code, I'm really only seeing two high-level options:

  1. Provide some way to configure this shared global pool.
    • The most-minimal option here might just be a constant "max bytes to reuse" size param, which can be set with go build -ldflags "-X go.uber.org/yarpc/internal/bufferpool.MaxReuseBytes=123456, and either don't return things larger than that to the pool at all, or truncate the buffer to fit the max.
    • We could also add a mutex/atomic, a SetPool(...) func, and re-expose it somewhere publicly (since this is in internal).
    • There does not seem to be a simple way to build a "don't pool more than xMB total" since pools don't report when they evict data, but it might be possible now that weak pointers exist.
  2. Allow injecting a pool implementation, so different calls(/clients/servers/?) can have different pools (which could even include a noop impl)
    • For our purposes, per procedure might be needed, or we'd probably set a rather small max (likely 100KB or so). In the "does not rapidly crash" vs "saves a few %-CPU" equation, we will absolutely choose to prevent crashes.

1 might be reasonable as a short-term hotfix, and it'll fairly trivially have no effect on anyone who isn't doing clearly-unstable changes to that value. Safe and quick but also obviously not a "good" solution. (I personally lean towards the -ldflags route because it does not offer a new "public" API that could be assumed to be stable)

2 seems like probably a lot of work. The pool is used in a LOT of places, through a LOT of very-different call stacks, and where to draw the line on "you can customize this instance with yarpc-option-X passed to call-Y" vs "there is no user-convenient way to get an argument into this call stack" will probably take some time to nail down.
Or maybe I'm just not familiar enough with the code and this will be reasonably simple (though clearly still changing quite a few lines of code).

Are there other options? Do you forsee any blockers in following option 2? We essentially need something, so we might be willing to do much of the implementation work, but it's not worth doing if the results won't be approved in the end.

Groxx avatar Feb 21 '25 21:02 Groxx

Hey,

Here's my thoughts.

Option 1

truncate the buffer to fit the max I guess it something we definitely shouldn't do, as slice's underlying array won't free memory anyway.

Option 1.5

You basically proposed a sub option: custom pool implementation (with weak pointers). If I'm not missing something, it won't change a lot: memory is already allocated, it doesn't hurt to put it back to the pool (if chunk itself is of reasonable size). If Go needs memory, it'll reuse memory either from the pool, or from weak pointers, or from unused (unreferenced) slices.

Option 2

We probably don't need individual pools, but rather list of pools with different object sizes that may be reused by different sources. For example, if we expect smaller response, we use pool with 0-100kb byte slices; for bigger response we use 100kb-10mb pool.

The question is indeed in passing configuration and picking right pool. Could you please elaborate on it: how can we determine response size expectation? Is it outbound name, request field, request body content?

My theory

My guess is that pool.New function is also a problem: it returns slice with default capacity, which means, this slice passes through multiple rounds of resizing to reach appropriate size (like 10mb). It leaves a lot of garbage behind.

What you need is a min-max bucket: pool.New should return a slice with good capacity (to fit 10mb let's say), and yes, we should limit size of the slice that's returned to the pool too.

As it may be tricky to use multiple pools (as a proposed above), we may try to use min-max configuration for the one pool using ldflags, just to see if it helps a lot. It may cause a problem though if you have thousands of smaller requests (as we'll use a lot of memory by creating 10mb chunks). If so, we should start from my question above: how to decide, what pool to use.

biosvs avatar Feb 24 '25 14:02 biosvs

1: Ah 🤦 good point. We could make a new buffer but that'd be kinda pointless (either a bunch of zero valued things which are the same as an empty pool but cost more to maintain, or wasted space if unused).

1.5: I mean more along the lines of "it is difficult to build any kind of logic that considers how much memory the pool holds", since we could in principle start not reusing large objects as the pool grows.

You can compute an upper bound (+len(buf) when returning to the pool, -len(buf) when getting from the pool) but there's essentially no way to know if the pool has released memory so that upper bound can be arbitrarily incorrect. Weak pointers would give us the ability to know when that happens, with some extra legwork, but I doubt it's worth actually going down this path.


Could you please elaborate on it: how can we determine response size expectation? Is it outbound name, request field, request body content?

In our case, this is for inbound responses, not outbound. So our options are a bit more limited, because we can't really use any external information to improve our guesses :|

(we do make outbound requests, and alleviating the inbound issue will probably point to them next, but inbound is largely unavoidable)

Loose guesses can be made per endpoint. Some of them have a fairly tight upper bound and some of them are expected to be extremely widely varying. And if we could divide our pool behavior by long-poll-vs-normal it might be our best effort-vs-reward choice. But we really can't know with anything like "good enough" predictability AFAICT.

it returns slice with default capacity, which means, this slice passes through multiple rounds of resizing to reach appropriate size (like 10mb).

Yeah, I was thinking about that too. Min+max makes sense as the absolute minimum if we have config at all. It's a bit understandable that it doesn't pre-size currently since it's zero-config... but then you run into "without config it probably shouldn't exist at all" since it's fundamentally unsafe.


If so, we should start from my question above: how to decide, what pool to use.

TBH I think the answer needs to be "yarpc cannot make that decision, because no one decision works for all scenarios". E.g. we have very limited information in our inbound requests. Even after parsing the request data, there are plenty of cases where even a 4-orders-of-magnitude guess can't be known until we check our database, and I see no reason to expect that other services wouldn't be similar.

So I think there are two goals here, and I'm only seeking one right now:

  1. what is a good default pool strategy
  2. how can we allow users to control the pool in exceptional situations

1 is an evergreen project, and might be worth a lot of experimentation. This would be very interesting but is not at all what I'm aiming for now.

2 is needed right now, and is still needed even if 1 makes a massive improvement on the default behavior because no one strategy works for everyone.

At the extremes, 2 is probably either "you can disable the pool entirely" or "provide a replacement impl with type Get func(call_ctx context.Context) (buf *bytes.Buffer, release func())" (which can trivially wrap a sync.Pool if desired)

Groxx avatar Feb 24 '25 19:02 Groxx

Note this task is the same as https://github.com/golang/go/issues/23199, right?

An interesting option would be something like https://go-review.googlesource.com/c/go/+/471200 - given it's stdlib, we could probably just copy it 1-1?

https://cs.opensource.google/go/go/+/refs/tags/go1.24.0:src/net/http/server.go;l=832 is also interesting, with built-in pool sharding

What you need is a min-max bucket: pool.New should return a slice with good capacity (to fit 10mb let's say), and yes, we should limit size of the slice that's returned to the pool too.

Seems generally reasonable.

rabbbit avatar Feb 24 '25 21:02 rabbbit

@rabbbit maybe... segmented buffers in particular might be reasonable, since it can leave the larger objects un-claimed in the pool until a large response begins (we don't stream anything, so we'd only see this at the end of a request, not for the full lifetime). And since I'd expect large-encoding CPU use to be way higher than the slice maintenance overhead it adds, that'd probably be a good tradeoff.

This might still cause some issues for streaming though, unless we're careful. E.g. it probably needs to return its buffers after each response chunk, or the possibly-much-longer-than-1m connections seem likely to just grow to meet the largest chunk size, and that might be even worse than our polling is currently causing.

So I guess:

  • Yes, similar problems. Any global cache is prone to this, and sync.Pool has quite well known problems with varying size items, this is basically just another instance.
    • There are some differences, as e.g. inbound calls hang for indeterminate (and sometimes "long") amounts of time, while e.g. encoding/json can generally be expected to not do that since MarshalJSON/etc doesn't have a context arg.
  • Yes, similar strategies seem entirely reasonable for yarpc's default behavior. In particular it's basically the same as http/server, and e.g. one of their pools simply discards everything that changed size.
  • No, I think segmented-like probably isn't sufficient.
    • It might be for Cadence, since we essentially hang and then return everything at once at the end, but it's not hard to come up with a hypothetical service that would still have problems. That might even include future-Cadence, for streaming.
    • Careful streaming-friendly memory releasing would probably reduce the risk here quite a lot, but I'm still really not fond of uncontrolled memory :\ And it feels like the sort of thing that's easy to miss when an issue is introduced, and only comes up months later when something crashes. A user-controlled "don't cache this endpoint" impl is trivially safe from this kind of risk.

The encoding/json uncontrolled global buffers are I think less bad since if they remain in memory they imply actively encoding large objects == there is other large-data hanging around. It also caps at 65KB (in buffer.go) which is relatively small.
Anything that doesn't follow that "short-lived retention" seems risky to me, and that's what's happening right now - a small request might reserve a very large buffer (MB) for a long time (1m).

(tbh I'm not fond of encoding/json's pools either, but without more-caller-controllable encoding APIs in Go I don't think they have much of an option - anyone doing the default json.Marshal(...) nested inside an ongoing Marshal will lose caller context, and there's no mechanism to prevent that at the moment)

Groxx avatar Feb 24 '25 22:02 Groxx

Well 🤔

After sleeping on it a bit, would it be essentially solved if we sharded by size (this seems important regardless) with a reasonable cap (64KB like encoding/json?) and didn't actually acquire anything from the pool until writes occurred, and returned them immediately after? It'd still be leaving memory laying around like all pools do, but with only short claims it might be dramatically smaller and with more "idle" periods to allow freeing.

I'm not familiar enough with the response serializing, but if it's generally done in ~1 large write (and some internal response header additions, which could predict the size needed) that'd probably be pretty safe. Streaming many small messages might suffer a bit, but that'd also be rapidly (re)using the pool so it'd probably mostly reuse memory.

Kinda seems like deferring that pool-Get could make this into roughly the same situation as encoding/json. So still a bit dubious but relatively defensible if the stdlib does it.

Groxx avatar Feb 25 '25 18:02 Groxx

and didn't actually acquire anything from the pool until writes occurred, and returned them immediately after? It'd still be leaving memory laying around like all pools do, but with only short claims it might be dramatically smaller and with more "idle" periods to allow freeing.

You mean, we don't return anything from the pool, until someone returned data to the pool? I probably not fully follow you here.

Paswel showed couple nice examples of how the pool problem addressed in std, and I actually learning toward http lib approach: it's very simple and clear. Marshaller implementation seems more useful for marshaller internals rather then for its client.

Here's my proposal: I can implement http lib approach as a "pool v2" kind of thing, and we'll switch services to this pool using ldflags. It'll allow us to test this approach on some services before we migrate everyone.

We'll double think about it, we probably may gather some statistic about request/response sizes in advance to make a decision about bucket sizes.

biosvs avatar Mar 04 '25 16:03 biosvs

You mean, we don't return anything from the pool, until someone returned data to the pool? I probably not fully follow you here.

I mean something like this sequence:

  1. request starts, gets a "buffer helper" (no sync.Pool data yet)
  2. request runs... (still no sync.Pool data)
  3. request writes some data to the buffer-helper: 3a. buffer-helper gets a sync.Pool buffer to store the data 3b. needs more, so buffer-helper gets a bigger sync.Pool buffer (either http-lib style "copy to larger buffer, discard smaller" or the segmented buffers are probably fine)
  4. write finishes, so old no-longer-needed sync.Pool buffers are returned (e.g. http-lib style. segmented buffers might not free anything)
    • probably only if they match the original size, else they keep growing
  5. yarpc flushes to network, reading from the buffer-helper, freeing/returning buffers if possible
    • I'm assuming that a streaming handler would do 3-5 repeatedly, so buffers held should drop to ~0 here.
  6. request finishes, buffer-helper is closed and any remaining buffers are returned to the sync.Pool

^ with the important part being that sync.Pool entries aren't acquired until the moment a write begins, and are returned ASAP. that way pool entries aren't held beyond when they're truly needed, which is the main current problem.

"pool entries keep growing larger" is a problem (segmented / httplib-style pools would solve that), but it's massively worsened by holding the pool entries for, like, thousands of times longer than needed. that means we have many times more buffers in memory than we are using, and things spend far, far less time in the pools where they are eligible to GC (they need to remain unused for 2 GC cycles).

Groxx avatar Mar 06 '25 01:03 Groxx

So, basically, "hold memory from the pool as little time as possible", get it.

I'll fix another problem (#2366) and will do some improvements and benchmarks to our internal pool too (there's a chance we'll get rid of it completely, as grpc has its own now).

biosvs avatar Mar 06 '25 20:03 biosvs

If the changes prove helpful: I might suggest something similar go into Zap, as we occasionally see fairly high memory use in its pool too, and it follows the same un-bucketed approach: https://github.com/uber-go/zap/blob/master/buffer/pool.go

... though in a pprof I'm looking at now, all I see are 1kb allocations so maybe that's fine? but possibly a TON of them (450k) which seems strange too. I might need to check that with a system I understand better.

Groxx avatar Mar 21 '25 21:03 Groxx

Well. Happened again, and this time the outbound-unary proto bufferpool was involved too:

Our inbounds were doing the same thing: Image

But we also had this outbound stack taking up a third of the memory, even after all requests were stopped: Image

It's not directly visible in that stack, but:

Call gets a buffer for serialization and defers cleanup around the outbound call in the same way: https://github.com/yarpc/yarpc-go/blob/e63ac5fdfc2b07aa5fa8004cd2820bbf96c776e1/encoding/protobuf/outbound.go#L77-L87

Which calls buildTransportRequest -> marshal, for proto in our call: https://github.com/yarpc/yarpc-go/blob/e63ac5fdfc2b07aa5fa8004cd2820bbf96c776e1/encoding/protobuf/outbound.go#L154

marshalProto gets a github.com/gogo/protobuf/proto.Buffer from the pool: https://github.com/yarpc/yarpc-go/blob/e63ac5fdfc2b07aa5fa8004cd2820bbf96c776e1/encoding/protobuf/marshal.go#L99-L107 which is reset immediately: https://github.com/yarpc/yarpc-go/blob/e63ac5fdfc2b07aa5fa8004cd2820bbf96c776e1/encoding/protobuf/marshal.go#L119-L123 which does nothing but reset the slice header, effectively repeating the not-size-limited-or-bucketed pool problem but with a couple additional steps: https://github.com/gogo/protobuf/blob/v1.3.1/proto/lib.go#L362-L366

// Reset resets the Buffer, ready for marshaling a new protocol buffer.
func (p *Buffer) Reset() {
	p.buf = p.buf[0:0] // for reading/writing
	p.index = 0        // for reading
}

And, like our inbounds, our outbounds can have huge numbers of long-idle concurrent requests. For our frontend service in particular, this happens nearly 1:1 with the inbound requests because it's largely just proxying to a consistent hash ring destination. So any large request semi-permanently bloats a buffer that is held for far longer than necessary.

So this pool also needs to check the cap(*proto.Buffer.Bytes()) in putBuffer to choose where it goes / if it should be discarded due to being over-sized.

This all likely applies to the v2 marshal pool too, though I haven't checked that sequence in as much detail: https://github.com/yarpc/yarpc-go/blob/e63ac5fdfc2b07aa5fa8004cd2820bbf96c776e1/encoding/protobuf/v2/marshal.go#L135-L14

Groxx avatar Mar 27 '25 01:03 Groxx

This caused a bunch of instability yesterday and today for us in Prod, we might need to request a bump in priority

davidporter-id-au avatar Apr 01 '25 17:04 davidporter-id-au