grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

transport: reuse memory on read path

Open MakMukhi opened this issue 7 years ago • 51 comments

One possible idea is to use a leaky buffer.

MakMukhi avatar Aug 22 '17 20:08 MakMukhi

Specific areas where we would like to do this:

  • Codec output buffer (requires codec API extension)
  • Reading from transport (I believe this is #869)

dfawley avatar Aug 22 '17 21:08 dfawley

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

muirdm avatar Dec 02 '17 00:12 muirdm

I would prefer that the codec object itself is in a sync pool and is assumed to be unsafe to use concurrently. This way the codec itself can optimize its buffer use. This codec that I would like to use with grpc is the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, [email protected] wrote:

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3 .

awalterschulze avatar Dec 02 '17 09:12 awalterschulze

Here is the correct link https://github.com/gogo/protobuf/blob/master/codec/codec.go

On Sat, 2 Dec 2017, 10:42 Walter Schulze, [email protected] wrote:

I would prefer that the codec object itself is in a sync pool and is assumed to be unsafe to use concurrently. This way the codec itself can optimize its buffer use. This codec that I would like to use with grpc is the one that I have implemented in github.com/gogo/protobuf/codec

On Sat, 2 Dec 2017, 01:18 Muir Manders, [email protected] wrote:

Is someone actively working on this? I've been playing around with different buffer pool strategies, but don't want to look too much more if I'm just duplicating work.

I've been trying bucketed []byte buffer pool where there are tiers of buffer sizes. For example, if you ask for a 700 byte buffer, it might pull a buffer from the 1024 byte pool, slice it down to 700 bytes and return it.

I'm trying a *sync.Pool implementation (one *sync.Pool per buffer size bucket) and a leaky buffer implementation (one buffered chan []byte per buffer size bucket). They certainly help, but neither one has plucked all the low-hanging garbage fruit on the first try.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-go/issues/1455#issuecomment-348649013, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvsLVtUmZ3k_U6PNTHhMQgt9P0qZ57qks5s8JddgaJpZM4O_JA3 .

awalterschulze avatar Dec 02 '17 09:12 awalterschulze

There is notable garbage produced outside of proto marshaling in the http2 transport, so I was hoping to address that too. I was looking at more generic []byte buffer reuse since it could be used by the existing codec, and used by the http2 transport.

How is pooling the codec better than just pooling the []byte slice? The existing codec could pull its []byte from []byte pool, and also implement similar gogoproto optimized logic taking advantage of Size() and MarshalTo() as appropriate.

muirdm avatar Dec 02 '17 12:12 muirdm

Thats interesting. I didn't think that the http2 transport could also benefit from buffer pooling. At the previous company I worked, we had a similar buffer pooling scheme for a different purpose. The API asks for a buffer size and returns that size of byte buffer for you. It does this by looking through its slice of buffers for a big enough buffer and then returning the resized to your smaller size. When buffers are returned they are stretched back to their original capacity. This buffer pool also had retention policies, which were a bit more complex, but implemented orthogonally with the buffer pool.

awalterschulze avatar Dec 02 '17 13:12 awalterschulze

My thinking for the codec was to add an optional method to it that, if implemented, would be called when the buffer it returns from Marshal is no longer needed by grpc. Then the codec could re-use that memory. Without something like this, it's impossible for the codec to safely reuse old memory buffers.

The http2 transport uses byte buffers to hold data it reads from the network; this can and should be re-used with a pool as well.

dfawley avatar Dec 02 '17 22:12 dfawley

My thought was instead of returning the buffer to the codec, return the buffer to the generic buffer pool instead. Then anyone can pull a buffer from the pool, and consumers of the buffers can put them back in the pool when they are done. It is safe to never return a buffer, or even to return a buffer that you did not get from the pool originally. On the other hand it is very unsafe to return a buffer and then still use it, or a slice of it. We would need some approach to make ourselves confident that use-after-release doesn't happen.

muirdm avatar Dec 02 '17 22:12 muirdm

What are other languages doing to optimize this situation? Maybe we can learn something from c++ or even Java.

On Sat, 2 Dec 2017, 23:26 Muir Manders, [email protected] wrote:

My thought was instead of returning the buffer to the codec, return the buffer to the generic buffer pool instead. Then anyone can pull a buffer from the pool, and consumers of the buffers can put them back in the pool when they are done. It is safe to never return a buffer, or even to return a buffer that you did not get from the pool originally. On the other hand it is very unsafe to return a buffer and then still use it, or a slice of it. We would need some approach to make ourselves confident that use-after-release doesn't happen.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-go/issues/1455#issuecomment-348724659, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvsLa5exKkGKa5OEtS5Rb-3p0UBGuf_ks5s8c6GgaJpZM4O_JA3 .

awalterschulze avatar Dec 03 '17 09:12 awalterschulze

The lowest hanging fruit, IMHO, is to reuse the memory buffer allocated to copy bytes from data frame, as @dfawley suggested. https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/http2_client.go#L869

This buffer can be recycled when the application has read it and copied to its preallocated chunck to be used for unmarshling. https://github.com/grpc/grpc-go/blob/cd563b81ec334abe62cda34a33b8640ea67ef5ba/transport/transport.go#L143

We can have a few sync.Pool with different sizes increasing in say, powers of 4, and capped to the max size of http2MaxFrameLen. Also it'd be great if the API is such that the base(4 in this example) is configurable so we can try different values with our benchmarks.

One problem with sync.Pool is that it drops all the memory during every GC cycle. In following efforts we can work with the Go team to resolve that. Using a leaky buffer comes with the onus of doing our own GC cycles for that memory. I don't think that's a good idea.

Things that C++ and Java do don't necessarily apply to Go. C-core takes a user configurable parameter to cap the max amount of memory to be used by the binary. They start releasing memory as pressure builds up. So, it's essentially a library specific GC.

Java, IIRC, allocates memory from the OS space and keeps reusing that without ever freeing it again. This memory is not freed by the GC either because it's outside of JVM's jurisdiction.

I personally think, in gRPC-Go, we shouldn't go about allocating a big a chunk and doing our memory management on that. We should at least give sync.Pool a chance and see if that can be optimized to fit our use case.

MakMukhi avatar Dec 04 '17 18:12 MakMukhi

Fair enough. It would be great to compare sync.Pool with classic buffer reuse.

awalterschulze avatar Dec 04 '17 19:12 awalterschulze

sync.Pool is not enough as follow issue say, https://github.com/golang/go/issues/23199 . Unbound capacity buffer will cause worse GC or memory overhead . Maybe chunked capacity pool/buffer would be better

Zeymo avatar Dec 21 '17 02:12 Zeymo

@Zeymo Thanks for the suggestions. We are aware of sync.Pool's behavior to not return the most fitting buffer. In the suggestion above I mention using buckets of sync.Pool with different sizes. However, as an exercise it may be interesting to know how would a single sync.Pool perform for the transport layer since we know no single buffer can be greater than http2MaxFrameLen which is 16K for now.

MakMukhi avatar Dec 21 '17 19:12 MakMukhi

@MakMukhi

I've got another idea to reuse the buffer for stream.

  1. Add a buf field in parser
type parser struct {
	r io.Reader
	header [5]byte
        buf []byte
}
  1. Make the buf if the length is greater than buf length
if length > cap(p.buf) {
        p.buf = make([]byte, int(length))
}
msg = p.buf[:length]

What do you think?

coocood avatar Dec 23 '17 13:12 coocood

@coocood

For your second code snippet, I think you'd actually want to compare against the buffer's capacity rather than length. You can actually slice up to the capacity of a buffer, not just the length. The gob package does something very similar:

https://github.com/golang/go/blob/f3f507b2d86bfb38c7e466a465d5b6463cfd4184/src/encoding/gob/decode.go#L64

That way, if you need buffers of size X, X-1, and then X, you aren't allocating twice.

xenoscopic avatar Dec 24 '17 06:12 xenoscopic

@havoc-io snippet updated, thanks.

coocood avatar Dec 24 '17 06:12 coocood

@coocood This actually seems like a good idea to reuse buffer on the gRPC layer (perhaps a variation of this might work on the transport layer as well). It's simple, doesn't require global shared pools. However, the only thing that worries me is the case when we have long lived streams with intermittent reading/writing activity of large messages (say >=1MB). In such a case, this stream will carry a large buffer for its life. Now if there were thousands of such streams we're in serious trouble. Perhaps, this problem can be solved by freeing this buffer every once in a while from the stream monitor loop here: https://github.com/grpc/grpc-go/blob/7f2472bbc6ac269af51a9c47ae8ef8c625383e1f/stream.go#L283

MakMukhi avatar Dec 27 '17 16:12 MakMukhi

@MakMukhi How about just adding a CallOption to disable the buffer reuse for the extreme use case? Freeing a buffer by another goroutine adds a lot of complexity.

coocood avatar Dec 28 '17 15:12 coocood

@coocood I like that idea too. You're right accessing the buffer from two different goroutines would require use of locks which is an additional performance overhead. CallOption seems like a good way to go about it. I'm not actively working on memory reuse currently, though it's high priority on my list of things to do. If you want to take a stab at it, go ahead. Otherwise, hopefully, I'll get to it Q1 2018. Also, thanks for these great ideas. :)

MakMukhi avatar Dec 28 '17 15:12 MakMukhi

@MakMukhi Great! I'll send a PR in a few days.

coocood avatar Dec 28 '17 15:12 coocood

What about pooled transport.Options and transport.CallHdr ?

Zeymo avatar Jan 05 '18 02:01 Zeymo

@MakMukhi any update on the reworked stream flow to reduce buffer use as mentioned here? we've noticed an issue where bombarding a stream endpoint with small messages causes lots of memory to be allocated on recvBuffer. backlog by HandleStreams(). since our handler function cannot pull messages off the backlog quickly enough, memory usage balloons until the server is killed.

it'd be nice if the buffer were at least some fixed length so that we could block the client's sends until there was room in the buffer. as it stands now, the server doesn't have any way of throttling the client and we have to rely on the client to send at a manageable rate.

scotthew1 avatar Mar 07 '18 21:03 scotthew1

Hey, @scotthew1
Sorry about the late response. I have written a design doc for this optimization but I haven't started its implementation yet because we are working on a major refactoring of our transport. This change will follow that refactor and I anticipate about 4 more weeks for this change to be merged.

MakMukhi avatar Mar 19 '18 21:03 MakMukhi

no problem @MakMukhi, thanks for the response. i look forward to the change!

scotthew1 avatar Mar 20 '18 19:03 scotthew1

Following the aforementioned design, I have created #1987. Once this is merged, we will experiment with reusing the buffer.

MakMukhi avatar Apr 12 '18 00:04 MakMukhi

@MakMukhi It looks like #1987 was reverted by #2049. Are there plans for a different strategy?

xenoscopic avatar May 08 '18 07:05 xenoscopic

@havoc-io Yes soon after we finished the implementation we identified potential security issues with that approach. Currently, we have following ideas to get rid of extraneous memory and our efforts are around identifying which one is most feasible in short term.

  1. Update the framer library to not reuse a frame's buffer until told to.

  2. Update proto deserialization to be able to work on multiple buffers.

  3. Use our own framing implementation to be able to use scatter/gather on underlying net.conn

MakMukhi avatar May 08 '18 21:05 MakMukhi

I've opened an issue on golang protobuf repo to provide an API to deserialize from multiple buffers proto.UnmarshalV, which is point 2 in the previous post. I'll be working on it further to hopefully provide an implementation for it.

MakMukhi avatar May 16 '18 23:05 MakMukhi

Just as an input for discussion, this is the allocation framegraph for an API gateway in our system, as you can see almost half of the allocations are due to gRPC opening connections to the backends:

image

GC is using around ~10% of overall CPU time, so if gRPC caused sigificantly less allocations it would likely decrease overall CPU usage by ~4%.

CAFxX avatar Apr 26 '19 04:04 CAFxX

@dfawley Can we reopen my old PR https://github.com/grpc/grpc-go/pull/1774

coocood avatar Apr 26 '19 05:04 coocood