hackney
hackney copied to clipboard
badarg received when using async
On a very high HTTP call volume server, using async to "fire and forget" continuously produces this message in the error logger:
2015-12-09 23:56:17 =ERROR REPORT====
Error in process <0.3450.45> on node '[...]' with exit value:
{badarg,[{erlang,port_info,[#Port<0.148406>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
2015-12-09 23:56:17 =SUPERVISOR REPORT====
Supervisor: {local,hackney_sup}
Context: child_terminated
Reason: shutdown
Offender: [{pid,<0.1418.45>},{id,[...]},{mfargs,{hackney_pool,start_link,[[...],[{name,[...]},{timeout,120000},{max_connections,8192}]]}},{restart_type,permanent},{shutdown,10000},{child_type,worker}]
{badarg,[{erlang,port_info,[#Port<0.58183>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]/hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[..]/hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
The call that causes this issue is:
hackney:request(post, URL, Headers, Body, [async, {connect_timeout, 10000}, {pool, PoolName}]),
_ = receive _ -> ok end,
Where the custom pool is defined with the following options:
[{timeout, 120000}, {max_connections, 8192}]
I'm gonna assume hackney is trying to give the socket back to the parent, but the socket is already closed remotely (by the remote server) hence the badarg. This situation should be managed in the same manner as the synchronous call.
@priestjim
This situation should be managed in the same manner as the synchronous call.
How do you mean? Return an error?
After further digging, I saw that this error should be getting caught here:
try
stream_loop(Parent, Owner, Ref, Client#client{parser=Parser,
response_state=on_status})
catch Class:Reason ->
but it's not. I have no idea why!
This may not be related to the above at all, but has the stream_loop called proc_lib:hibernate/3
at any point? That would explain the problem, since if the process hibernates, that catch is effectively gone forever from the call stack. I don't know exactly what would happen with the backtrace from get_stracktrace() though, so it could be non-related. I'm somewhat positive that code will never work as intended once the process hibernates, however.
I think it does in multiple points. Lines 99, 140 and 241 can all be called while in the stream_loop function.
@benoitc is there any critical reason to hibernate at those points? @jlouis you saved the day again :)
But does it happen in your case that they are called? It could explain the problem you are seeing. Add a helper function to the module which calls hibernate, then use dbg
to trace if that function gets called in production, I'd say. This should at least make it feasible to check for the case.
I am almost sure it does. The problem is that this happens only on extremely high number of requests, trying it locally never triggers the issue. I'll give it a shot though.
@priestjim all this code should be simpler. I some code on the machine for it. I will try to finish the new_pool branch and do a release until the week-end with it. The general idea is to simplify the loop and the flow between the requester and the stream.
@jlouis i guess hackney_trace when enabled should be able to catch it. It may worth a try.
Please do, we're dying over here - we got ~350K calls/sec at one point :)
@priestjim should land on tuesday, i made good progress on it :)
@benoitc any progress on the new pool arch? Need a hand?
@priestjim #276 is the ticket to track it. It should land sometimes next week. In the mean time I may have a quick fix about the hard limit.
Any updates on this issue? I am meeting exactly same issue with version 1.15.1 ?
hmm should have been closedby #276. There are 2 milestones coming also with a release of the 1.16 next week and the 2.0 that will only be erlang 21 and sup. The 2.0 is a complete rework of the connection handling.
Are you saying that this bug is closed already? or Will it be closed after 2.0?