chatterbox Binary leaking in h2_connection processes

Binary leaking in h2_connection processes

Open jonasrichard opened this issue 6 years ago • 15 comments

I use chatterbox as a dependency of the inaka/apns4erl project. After some hours of running of the application I faced with ever growing process sizes (process_info(pid, memory)). Did some investigation I found this:

process_info(pid(0,1238,0), binary).
{binary,[{139665827264464,60,1},
         {139665831210656,60,1},
         {139664600088128,60,1},
         {139665831140616,60,1},
         {139665831222376,60,1},
         {139664600244240,60,1},
         {139665827188416,60,1},
         {139664600522856,60,1},
         {139664605657712,60,1},
         {139664605954696,60,1},
         {139664605903872,60,1},
         {139664606052304,60,1},
         {139664600433752,60,1},
         {139664605649112,60,1},
         {139664531257216,60,1},
         {139665827110208,60,1},
         {139664605153864,60,1},
         {139665827411976,60,1},
         {139665827397160,60,1},
         {139665827115824,60,1},
         {139664605243344,60,1},
         {139664605133416,60,1},
         {139664605420536,60,1},
         {139664605405256,60,1},
         {139664637840368,60,...},
         {139664638044448,...},
         {...}|...]}
(myhost@my_ip)48> length(element(2, v(47))).
1851

Do you have a clue where this binary leak can occur?

Oct 31 '17 17:10 jonasrichard

If you call for a manual garbage collection of the process, do these go away? If so the leak is probably fixable through well-placed hibernation calls more than anything else.

Oct 31 '17 21:10 ferd

I did a erlang:garbage_collect/0 but the binaries didn't go away. I don't know the process structure of chatterbox, but I think there are two processes pass the binaries, and after some matching one process hold refc binaries. It is hard to find when those binaries remain, because there were at least 50,000 http2 requests were sent (so much more than the number of refc binaries). I think it happens only in edge cases. I didn't use chatterbox directly but apns4erl. https://github.com/inaka/apns4erl Tomorrow I will check if other processes has something similar. Do you think in a crashdump I can see those binaries? Or the GC during crashdump will eliminate them?

Nov 01 '17 16:11 jonasrichard

If you can see them in the crash dump, they were really being held in memory by the process. If they are gone, that's because the VM ran a GC before writing them to the dump.

Nov 01 '17 17:11 ferd

Ok, calling the garbage_collect(Pid) decreased the number of refc binaries from 1932 to 84. So probably we will call GC for only those processes when the refc binaries count is more than a threshold. Does it sound right (not too odd)?

Nov 02 '17 09:11 jonasrichard

That can work temporarily. I think the ideal fixes in these cases for now is to modify the risky processes to run hibernation tasks from time to time when they get downtime so they force this on their own.

Nov 02 '17 10:11 ferd

It is also interesting that process_info(memory) and heap size in normal process_info gives me different results. And like I said, the reductions were huge. So I decided to reconnect in every 10 minute instead of the GC.

> process_info(P1, memory).
{memory,201303168}
> process_info(P1).
...
 {total_heap_size,17474138},
 {heap_size,2072833},
 {stack_size,10},
 {reductions,123302750526},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,41}]},
...

Nov 02 '17 11:11 jonasrichard

Looking at this again, if it makes sense we could add a hibernate_after option to the gen_statem:start_link of h2_connection so it hibernates if idle.

Suggestions for what a good timeout would be?

Jun 25 '18 23:06 tsloughter

Did anyone find the issue with the leaking memory? I have the same problem, the garbage_collect method mentioned above reclaimed about ~1/3 of the memory.

> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,73238512},<0.16417.364>},
 {{memory,73238512},<0.16368.364>},
 {{memory,73238512},<0.16350.364>},
 {{memory,73238512},<0.16343.364>},
 {{memory,70935368},<0.16419.364>},
 {{memory,70935368},<0.16372.364>},
 {{memory,70935368},<0.16325.364>},
 {{memory,69016080},<0.16440.364>},
 {{memory,69016080},<0.16429.364>},
 {{memory,69016080},<0.16422.364>},
 {{memory,69016080},<0.16405.364>},
 {{memory,69016080},<0.16384.364>},
 {{memory,69016080},<0.16338.364>},
 {{memory,69016080},<0.16337.364>},
 {{memory,69016080},<0.16319.364>},
 {{memory,61032264},<0.16391.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]
> process_info(list_to_pid("<0.16417.364>")).
[{current_function,{gen_statem,loop_receive,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.16300.364>,#Port<0.79272>]},
 {dictionary,[{'$initial_call',{h2_connection,init,1}},
              {'$ancestors',[<0.16300.364>,<0.16257.364>,
                             grpcbox_channel_sup,grpcbox_sup,<0.1063.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1062.0>},
 {total_heap_size,8866796},
 {heap_size,1439468},
 {stack_size,11},
 {reductions,47123903084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6023}]},
 {suspending,[]}]
> garbage_collect(list_to_pid("<0.16384.364>")).
....
> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,85122328},<0.16391.364>},
 {{memory,55082824},<0.16440.364>},
 {{memory,52779680},<0.16372.364>},
 {{memory,50860392},<0.16429.364>},
 {{memory,50860392},<0.16325.364>},
 {{memory,50860392},<0.16319.364>},
 {{memory,48205672},<0.16422.364>},
 {{memory,48205672},<0.16337.364>},
 {{memory,43983240},<0.16343.364>},
 {{memory,42474712},<0.16384.364>},
 {{memory,42383832},<0.16419.364>},
 {{memory,40171568},<0.16350.364>},
 {{memory,40171568},<0.16338.364>},
 {{memory,38252280},<0.16417.364>},
 {{memory,38252280},<0.16405.364>},
 {{memory,38252280},<0.16368.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]

I'm using chatterbox via grpcbox.

Aug 06 '20 08:08 sebastiw

Yeah the post before yours does mention a possible remediation to hibernate when idle. Someone likely just needs to go and implement it.

Aug 06 '20 12:08 ferd

So we have been tracing and found out that the client responses are only cleaned up from the h2_stream state if the function h2_connection:get_response/2 is called.

For instance grpcbox_client:unary_handler does not call this function. for chatterbox, I think there should be an api for closing/removing a stream when not using get_response, and that function could then be called by client libraries such as grpcbox.

What do you think?

Aug 11 '20 09:08 sebastiw

The grpcbox client doesn't call it because it is getting the body from messages to the client pid, right? If that is the case then it sounds more like chatterbox should have a mode that it keeps the response until it is requested and another where it only sends to some pid.

Aug 11 '20 14:08 tsloughter

yes, there seem to be two flows in h2; one with asynchronous bangs to a notify pid, and one with some synchronous parts. might be a good idea to split them up in two, or clarify which is which.

we went with an easy quick fix which I will try to test tomorrow. Basically just added h2_connection:get_response(Pid, Stream), after receiving the data in grpcbox_client:unary_handler/6

Aug 11 '20 20:08 sebastiw

Yea, pretty sure the notify a pid flow was added on as a quick hack at the time. Definitely could use some cleaning up.

Aug 11 '20 21:08 tsloughter

Looking at this again.

As far as I can tell from the code the only time frames are added to the queue that becomes the response body received is if the callback is undefined. So I don't get how the response body is built up if this is the case.

Sep 12 '22 22:09 tsloughter

Adding a hibernate_after option to h2_connection, but also merged a fix for terminating grpc streams and that may help, not sure, https://github.com/tsloughter/chatterbox/pull/15

Sep 12 '22 22:09 tsloughter

chatterbox chatterbox copied to clipboard

Binary leaking in h2_connection processes

chatterbox
chatterbox copied to clipboard