hackney badarg received when using async

On a very high HTTP call volume server, using async to "fire and forget" continuously produces this message in the error logger:

2015-12-09 23:56:17 =ERROR REPORT====
Error in process <0.3450.45> on node '[...]' with exit value:
{badarg,[{erlang,port_info,[#Port<0.148406>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[...]hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
2015-12-09 23:56:17 =SUPERVISOR REPORT====
     Supervisor: {local,hackney_sup}
     Context:    child_terminated
     Reason:     shutdown
     Offender:   [{pid,<0.1418.45>},{id,[...]},{mfargs,{hackney_pool,start_link,[[...],[{name,[...]},{timeout,120000},{max_connections,8192}]]}},{restart_type,permanent},{shutdown,10000},{child_type,worker}]

{badarg,[{erlang,port_info,[#Port<0.58183>,connected],[]},{inet,tcp_controlling_process,2,[{file,"inet.erl"},{line,1522}]},{hackney_stream,stream_loop,4,[{file,"[...]/hackney/src/hackney_stream.erl"},{line,85}]},{hackney_stream,init,4,[{file,"[..]/hackney/src/hackney_stream.erl"},{line,30}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

The call that causes this issue is:

hackney:request(post, URL, Headers, Body, [async, {connect_timeout, 10000}, {pool, PoolName}]),
_ = receive _ -> ok end,

Where the custom pool is defined with the following options:

[{timeout, 120000}, {max_connections, 8192}]

Dec 09 '15 23:12 priestjim

I'm gonna assume hackney is trying to give the socket back to the parent, but the socket is already closed remotely (by the remote server) hence the badarg. This situation should be managed in the same manner as the synchronous call.

Dec 10 '15 00:12 priestjim

@priestjim

This situation should be managed in the same manner as the synchronous call.

How do you mean? Return an error?

Dec 10 '15 10:12 benoitc

After further digging, I saw that this error should be getting caught here:

try
    stream_loop(Parent, Owner, Ref, Client#client{parser=Parser,
                                                  response_state=on_status})
catch Class:Reason ->

but it's not. I have no idea why!

Dec 10 '15 17:12 priestjim

This may not be related to the above at all, but has the stream_loop called proc_lib:hibernate/3 at any point? That would explain the problem, since if the process hibernates, that catch is effectively gone forever from the call stack. I don't know exactly what would happen with the backtrace from get_stracktrace() though, so it could be non-related. I'm somewhat positive that code will never work as intended once the process hibernates, however.

Dec 10 '15 19:12 jlouis

I think it does in multiple points. Lines 99, 140 and 241 can all be called while in the stream_loop function.

@benoitc is there any critical reason to hibernate at those points? @jlouis you saved the day again :)

Dec 10 '15 20:12 priestjim

But does it happen in your case that they are called? It could explain the problem you are seeing. Add a helper function to the module which calls hibernate, then use dbg to trace if that function gets called in production, I'd say. This should at least make it feasible to check for the case.

Dec 10 '15 20:12 jlouis

I am almost sure it does. The problem is that this happens only on extremely high number of requests, trying it locally never triggers the issue. I'll give it a shot though.

Dec 10 '15 20:12 priestjim

@priestjim all this code should be simpler. I some code on the machine for it. I will try to finish the new_pool branch and do a release until the week-end with it. The general idea is to simplify the loop and the flow between the requester and the stream.

@jlouis i guess hackney_trace when enabled should be able to catch it. It may worth a try.

Dec 10 '15 20:12 benoitc

Please do, we're dying over here - we got ~350K calls/sec at one point :)

Dec 13 '15 05:12 priestjim

@priestjim should land on tuesday, i made good progress on it :)

Dec 13 '15 19:12 benoitc

@benoitc any progress on the new pool arch? Need a hand?

Jan 04 '16 11:01 priestjim

@priestjim #276 is the ticket to track it. It should land sometimes next week. In the mean time I may have a quick fix about the hard limit.

Jan 25 '16 10:01 benoitc

Any updates on this issue? I am meeting exactly same issue with version 1.15.1 ?

May 20 '19 10:05 fogfish

hmm should have been closedby #276. There are 2 milestones coming also with a release of the 1.16 next week and the 2.0 that will only be erlang 21 and sup. The 2.0 is a complete rework of the connection handling.

May 21 '19 20:05 benoitc

Are you saying that this bug is closed already? or Will it be closed after 2.0?

May 23 '19 18:05 fogfish

hackney hackney copied to clipboard

badarg received when using async

hackney
hackney copied to clipboard