chatterbox icon indicating copy to clipboard operation
chatterbox copied to clipboard

Binary leaking in h2_connection processes

Open jonasrichard opened this issue 6 years ago • 15 comments

I use chatterbox as a dependency of the inaka/apns4erl project. After some hours of running of the application I faced with ever growing process sizes (process_info(pid, memory)). Did some investigation I found this:

process_info(pid(0,1238,0), binary).
{binary,[{139665827264464,60,1},
         {139665831210656,60,1},
         {139664600088128,60,1},
         {139665831140616,60,1},
         {139665831222376,60,1},
         {139664600244240,60,1},
         {139665827188416,60,1},
         {139664600522856,60,1},
         {139664605657712,60,1},
         {139664605954696,60,1},
         {139664605903872,60,1},
         {139664606052304,60,1},
         {139664600433752,60,1},
         {139664605649112,60,1},
         {139664531257216,60,1},
         {139665827110208,60,1},
         {139664605153864,60,1},
         {139665827411976,60,1},
         {139665827397160,60,1},
         {139665827115824,60,1},
         {139664605243344,60,1},
         {139664605133416,60,1},
         {139664605420536,60,1},
         {139664605405256,60,1},
         {139664637840368,60,...},
         {139664638044448,...},
         {...}|...]}
(myhost@my_ip)48> length(element(2, v(47))).
1851

Do you have a clue where this binary leak can occur?

jonasrichard avatar Oct 31 '17 17:10 jonasrichard

If you call for a manual garbage collection of the process, do these go away? If so the leak is probably fixable through well-placed hibernation calls more than anything else.

ferd avatar Oct 31 '17 21:10 ferd

I did a erlang:garbage_collect/0 but the binaries didn't go away. I don't know the process structure of chatterbox, but I think there are two processes pass the binaries, and after some matching one process hold refc binaries. It is hard to find when those binaries remain, because there were at least 50,000 http2 requests were sent (so much more than the number of refc binaries). I think it happens only in edge cases. I didn't use chatterbox directly but apns4erl. https://github.com/inaka/apns4erl Tomorrow I will check if other processes has something similar. Do you think in a crashdump I can see those binaries? Or the GC during crashdump will eliminate them?

jonasrichard avatar Nov 01 '17 16:11 jonasrichard

If you can see them in the crash dump, they were really being held in memory by the process. If they are gone, that's because the VM ran a GC before writing them to the dump.

ferd avatar Nov 01 '17 17:11 ferd

Ok, calling the garbage_collect(Pid) decreased the number of refc binaries from 1932 to 84. So probably we will call GC for only those processes when the refc binaries count is more than a threshold. Does it sound right (not too odd)?

jonasrichard avatar Nov 02 '17 09:11 jonasrichard

That can work temporarily. I think the ideal fixes in these cases for now is to modify the risky processes to run hibernation tasks from time to time when they get downtime so they force this on their own.

ferd avatar Nov 02 '17 10:11 ferd

It is also interesting that process_info(memory) and heap size in normal process_info gives me different results. And like I said, the reductions were huge. So I decided to reconnect in every 10 minute instead of the GC.

> process_info(P1, memory).
{memory,201303168}
> process_info(P1).
...
 {total_heap_size,17474138},
 {heap_size,2072833},
 {stack_size,10},
 {reductions,123302750526},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,41}]},
...

jonasrichard avatar Nov 02 '17 11:11 jonasrichard

Looking at this again, if it makes sense we could add a hibernate_after option to the gen_statem:start_link of h2_connection so it hibernates if idle.

Suggestions for what a good timeout would be?

tsloughter avatar Jun 25 '18 23:06 tsloughter

Did anyone find the issue with the leaking memory? I have the same problem, the garbage_collect method mentioned above reclaimed about ~1/3 of the memory.

> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,73238512},<0.16417.364>},
 {{memory,73238512},<0.16368.364>},
 {{memory,73238512},<0.16350.364>},
 {{memory,73238512},<0.16343.364>},
 {{memory,70935368},<0.16419.364>},
 {{memory,70935368},<0.16372.364>},
 {{memory,70935368},<0.16325.364>},
 {{memory,69016080},<0.16440.364>},
 {{memory,69016080},<0.16429.364>},
 {{memory,69016080},<0.16422.364>},
 {{memory,69016080},<0.16405.364>},
 {{memory,69016080},<0.16384.364>},
 {{memory,69016080},<0.16338.364>},
 {{memory,69016080},<0.16337.364>},
 {{memory,69016080},<0.16319.364>},
 {{memory,61032264},<0.16391.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]
> process_info(list_to_pid("<0.16417.364>")).
[{current_function,{gen_statem,loop_receive,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.16300.364>,#Port<0.79272>]},
 {dictionary,[{'$initial_call',{h2_connection,init,1}},
              {'$ancestors',[<0.16300.364>,<0.16257.364>,
                             grpcbox_channel_sup,grpcbox_sup,<0.1063.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1062.0>},
 {total_heap_size,8866796},
 {heap_size,1439468},
 {stack_size,11},
 {reductions,47123903084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6023}]},
 {suspending,[]}]
> garbage_collect(list_to_pid("<0.16384.364>")).
....
> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,85122328},<0.16391.364>},
 {{memory,55082824},<0.16440.364>},
 {{memory,52779680},<0.16372.364>},
 {{memory,50860392},<0.16429.364>},
 {{memory,50860392},<0.16325.364>},
 {{memory,50860392},<0.16319.364>},
 {{memory,48205672},<0.16422.364>},
 {{memory,48205672},<0.16337.364>},
 {{memory,43983240},<0.16343.364>},
 {{memory,42474712},<0.16384.364>},
 {{memory,42383832},<0.16419.364>},
 {{memory,40171568},<0.16350.364>},
 {{memory,40171568},<0.16338.364>},
 {{memory,38252280},<0.16417.364>},
 {{memory,38252280},<0.16405.364>},
 {{memory,38252280},<0.16368.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]

I'm using chatterbox via grpcbox.

sebastiw avatar Aug 06 '20 08:08 sebastiw

Yeah the post before yours does mention a possible remediation to hibernate when idle. Someone likely just needs to go and implement it.

ferd avatar Aug 06 '20 12:08 ferd

So we have been tracing and found out that the client responses are only cleaned up from the h2_stream state if the function h2_connection:get_response/2 is called.

For instance grpcbox_client:unary_handler does not call this function. for chatterbox, I think there should be an api for closing/removing a stream when not using get_response, and that function could then be called by client libraries such as grpcbox.

What do you think?

sebastiw avatar Aug 11 '20 09:08 sebastiw

The grpcbox client doesn't call it because it is getting the body from messages to the client pid, right? If that is the case then it sounds more like chatterbox should have a mode that it keeps the response until it is requested and another where it only sends to some pid.

tsloughter avatar Aug 11 '20 14:08 tsloughter

yes, there seem to be two flows in h2; one with asynchronous bangs to a notify pid, and one with some synchronous parts. might be a good idea to split them up in two, or clarify which is which.

we went with an easy quick fix which I will try to test tomorrow. Basically just added h2_connection:get_response(Pid, Stream), after receiving the data in grpcbox_client:unary_handler/6

sebastiw avatar Aug 11 '20 20:08 sebastiw

Yea, pretty sure the notify a pid flow was added on as a quick hack at the time. Definitely could use some cleaning up.

tsloughter avatar Aug 11 '20 21:08 tsloughter

Looking at this again.

As far as I can tell from the code the only time frames are added to the queue that becomes the response body received is if the callback is undefined. So I don't get how the response body is built up if this is the case.

tsloughter avatar Sep 12 '22 22:09 tsloughter

Adding a hibernate_after option to h2_connection, but also merged a fix for terminating grpc streams and that may help, not sure, https://github.com/tsloughter/chatterbox/pull/15

tsloughter avatar Sep 12 '22 22:09 tsloughter