yarpc-go icon indicating copy to clipboard operation
yarpc-go copied to clipboard

inbound decode path: investigate high memory usage

Open venkat1109 opened this issue 5 years ago • 5 comments

Recently, I was investigating a high memory issue in production and came across the attached heap profile on some of the hosts. The host wasn't doing much other than handling about 100-200 requests / second. But it was consuming about 4-5G of memory and increasing.

compute4130-heap

Without digging deeper, the root cause seems to be because of holding onto the decode buffer for the duration of the rpc call https://github.com/yarpc/yarpc-go/blob/ff8d19f1accd1b94e829b698d205a7258f82ddc9/encoding/thrift/inbound.go#L50

Based on initial discussion with Prashant, the fix probably isn't simple because of lazy decoding of thrift payload. Creating this issue for further investigation.

venkat1109 avatar May 20 '19 15:05 venkat1109

Took a look at this with Prashant and Abhinav offline, this would be pretty complicated to fix, unfortunately. Thrift does lazy decoding, as you mentioned, and there's no way to signal from the encoding layer, back up to the transport layer that encoding layer is "finished" with the bytes.

A super hacky solution would be to pass a "finisher" function from the TChannel layer, onto the context, and pull out that function in the encoding layer. This is the only way to expose more information to the encoding layer unfortunately, due to this interface.

This would take quite some effort to fix and verify, which we don't have the cycles for right now (as we focus on gRPC/Protobuf general availability). I'm willing to leave this ticket open however, in case this issues comes up with more services and we end up having more cycles to make a proper fix.

peats-bond avatar May 20 '19 18:05 peats-bond

Alright, let's keep this open in that case - if this ever becomes high priority - let's get back to this

venkat1109 avatar May 21 '19 15:05 venkat1109

@AlexAPB Hi, this issue is happening again on our production cluster, causing our frontend host to periodically restart due to OOM. profile007.pdf

Is there any plan to fix this issue soon?

wxing1292 avatar Nov 01 '19 21:11 wxing1292

Referred to internally track for now. I've asked @wxing1292 to file a ticket.

peats-bond avatar Nov 01 '19 22:11 peats-bond

Discussed off-ticket. This shows about 100 RPS with 2 minute deadlines. There are intermediate buffers held between TChannel and Thrift. TChannel retains at least one 64KB frame buffer regardless of payload size, which it will release when it’s closed. ThriftRW could possibly be induced to release that buffer earlier, and keep only its own copy of any of the data structures like lists that it reads lazily.

However, for now, we are not going to staff a solution to optimize for the case of long polling. Contributions to optimize TChannel and ThriftRW are welcome.

kriskowal avatar Nov 04 '19 21:11 kriskowal