hackney
hackney copied to clipboard
Hackney pool overload
Hello,
It is possible to overload hackney pool and render it unresponsive or even crash erlang node. The case is like this:
Some process sending simple http requests on a server that can handle not more then N RPS with rate N * k(k >= 2)
At some point hackney pool queue size begins to grow indefinitely(with my numbers its ~300 rps of receiving server and 600-700 rps for client).
Process info for a pool shows this:
[{current_function,{queue,filter_r,2}},
{initial_call,{proc_lib,init_p,5}},
{status,runnable},
{message_queue_len,305621},
{messages,[{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",
80,hackney_tcp},
#Ref<0.0.1835010.103289>}},
{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
hackney_tcp},
#Ref<0.0.1572872.124197>}},
{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
hackney_tcp},
#Ref<0.0.1835009.101047>}},
{'DOWN',#Ref<0.0.1572866.107847>,request,<0.15587.32>,
normal},
{'DOWN',#Ref<0.0.1572868.125051>,request,<0.15587.32>,
normal},
{'DOWN',#Ref<0.0.1572869.178066>,request,<0.15587.32>,
normal},
{'DOWN',#Ref<0.0.1572865.99206>,request,<0.15586.32>,normal},
{'DOWN',#Ref<0.0.1572865.158565>,request,<0.15586.32>,
normal},
{'DOWN',#Ref<0.0.1572867.214897>,request,<0.15586.32>,
normal},
{'DOWN',#Ref<0.0.1572866.107844>,request,<0.15584.32>,
normal},
{'DOWN',#Ref<0.0.1572869.91611>,request,<0.15584.32>,normal},
{'DOWN',#Ref<0.0.1572872.6542>,request,<0.15584.32>,normal},
{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
hackney_tcp},
#Ref<0.0.1835011.64499>}},
{'$gen_cast',{checkout_cancel,{"tapi.sgrd.ix.wgcrowd.io",80,
hackney_tcp},
#Ref<0.0.1572872.124201>}},
{'DOWN',#Ref<0.0.1572867.71002>,request,<0.15585.32>,normal},
{'DOWN',#Ref<0.0.1572871.14102>,request,<0.15585.32>,normal},
{'DOWN',#Ref<0.0.1572868.211148>,request,<0.15585.32>,
normal},
{'$gen_cast',{checkout_cancel,{...},...}},
{'$gen_cast',{checkout_cancel,...}},
{'$gen_call',{...},...},
{'$gen_cast',...},
{...}|...]},
{links,[<0.982.0>]},
{dictionary,[{'$ancestors',[hackney_sup,<0.981.0>]},
{'$initial_call',{hackney_pool,init,1}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.980.0>},
{total_heap_size,25188109},
{heap_size,2984878},
{stack_size,8939},
{reductions,7087623884},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,199}]},
{suspending,[]}]
Pool configuration looks like this:
max_conn => 100,
timeout => 3000
Since message queue size is huge, pool becomes completely unresponsive.
My proposal: maybe its a good idea to set up some back pressure or introduce some defensive mechanism that will allow to avoid pool lock.
Also it seems a bit unclear when hackney_pool answers {error, connect_timeout}
for gen_server:call
timeout to a pool. Forces confusion in case described above. Connection itself has not been even tried to be established. Since gen_server timeouted its call.
i noticed similar issues and when testing have alieviated a lot of them with this fix: https://github.com/benoitc/hackney/pull/540 it's not a complete fix but it definitely helps the symptoms here @benoitc
I'm not sure if it's the same problem but my pools are getting stuck as well. I'm using hackney 1.15.0. Running get stats on the pool returns:
iex([email protected])14> :hackney_pool.get_stats(:"client-logger")
[
name: :"client-logger",
max: 400,
in_use_count: 0,
free_count: 0,
queue_count: 13961
]
in_use_count is 0 but the queue count is huge.