hackney :connect_timeout errors on initial batch of requests

I've been following/commenting on some previous issues regarding :connect_timeout errors reported here, and recent 0.11/12 releases have reduced these dramatically 👏

However, I'm still experiencing an interesting issue using Hackney 1.12.1 and HTTPoison 0.13.0.

I make an initial batch of about ~80 requests to several hosts simultaneously. Of those, about half result in %HTTPoison.Error{id: nil, reason: :connect_timeout}. The same batch repeats every 2 minutes, and the second time around, only ~10-15 fail. After that, subsequent batches produce not :connect_timeout errors. With HTTPoison, I'm passing [timeout: 1000, recv_timeout: 5000] as config.

This has been consistently happening for some time now — errors in first 2 batches, then all is good. I'm wondering why this is happening and is there some kind of a workaround that could be applied? Thanks!

Update

Just wanted to add that all requests are fired within a Task.start/1 function.

Other versions: Elixir 1.6, OTP 20.0.

May 01 '18 16:05 svileng

Hello,

Same here (or as https://github.com/benoitc/hackney/issues/462). After a week on a production server, it hangs on {error,connect_timeout}. The only way to fix it is restarting the node.

Plain Erlang 20.1 (erts-9.2) Hackney 1.12.1 (e6d84f40a4d2d650659ac54dba99c0bcbbb86af4).

I'll add a state of hackney pool here when it happens next time.

May 03 '18 18:05 desoulter

@svileng can you eventually provide a trace of your connections. It seems weird to return connect_timeout the first time. If the pool is empty or no host is already connected it should normally connect directly to it. Also how many distinct host do you have?

@desoulter ^^

May 03 '18 22:05 benoitc

@benoitc Thank you for looking into this!

I would guess the pool is empty or at least nearly empty — as soon as my application starts, a GenServer schedules the first batch of requests after 2 minutes and they are made on the same interval after that. It might be possible that a few individual requests are made before the batch kicks in. The errors consistently happen the first 2 times (second time around always produces less errors) when the application starts.

The number of distinct hosts is exactly 30 (not quite "several" as I said above 😆).

How can I get a trace, could you please point me to the right configuration option?

May 04 '18 09:05 svileng

@svileng you can start to trace with hackney_trace:enable/2:

hackney_trace:enable(max, io).

Put that line before starting all your requests and it should log to outpu. io can be replaced by a filename as a string.

Hope it helps.

May 04 '18 09:05 benoitc

@benoitc I repeated the issue with {error,connect_timeout} just after making about 20 requests. Little information: my application have a background InfluxDB Reporter for exometer which also uses hackney as http-client. So, here is a pool state: https://gist.github.com/desoulter/4c8fca893c965f2fb57c600936a7c133 And a trace: https://gist.github.com/desoulter/841bc24d63ea3e225be8be53d65c2197

May 05 '18 15:05 desoulter

@benoitc Apologies for the delay. I enabled the trace locally on my dev machine, where only 1 request is being made, and there is quite a lot of stuff printed out, including some that I cannot share if I run this in production (for example: Authorization headers, IP addresses, body, the actual URL and others...).

What trace/debug information do you need specifically from the requests and how can I make hackney_trace print out only what we need to debug this issue? Worst case, I can try remove sensitive information from the trace manually, before pasting it here, but that would be a pain...

Thanks!

Update: just ran it briefly in production for one batch (first one) of requests, before reverting the change. I got 9288 lines 😵Let me know what's useful so I can grab it out of the file. I noticed quite a lot of no socket in the pool messages, often in batches like this one below:

[hackney trace 80 <0.736.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.712.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.733.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.720.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.751.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.758.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.714.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.746.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.730.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.740.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.735.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.725.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]
[hackney trace 80 <0.721.0> 2018:05:16 15:49:43 496] no socket in the pool
   Content: [{module,hackney_connect},{line,200},{pool,default}]

But I'm not sure if it is standard debug info or an indicator of some sort of a problem?

May 16 '18 15:05 svileng

@svileng @desoulter thanks, looking at it. I will revisit that issue sometimes tomorrow and provide more feedback

@svileng this looks standard. if you see anything weird related to the timeout just let me know. anything related to the pool actually

May 16 '18 16:05 benoitc

@benoitc I see. I'll clean the log file I have and will post it today or tomorrow — you know best what to look for, I'll just clear up the sensitive info 👍

May 17 '18 14:05 svileng

@benoitc Apologies for the delay, please find the log file here: hackney-log-cleaned.txt

I have removed the following information and replaced it with [FILTERED], in case you're wondering what's going on:

Host info for requests;
Auth header values and some response headers;
Request json body

As I mentioned before, there are 30 distinct hosts, which are various Salesforce instances in different regions; the API calls however are all the same (you can notice /services/data/v39.0/composite/batch all over the place).

Let me know if you have any questions or if there is anything I can do to help investigate the issue! Thanks!

Jul 03 '18 15:07 svileng

@benoitc Have you had a chance to look at the logs from me and @desoulter above? Let us know if there's anything we could do to help investigate the issue. Thanks!

Aug 20 '18 13:08 svileng

@svileng sorry i missed the initial notifications. I will have a look today. Sorry for the delay

Aug 21 '18 08:08 benoitc

This happens on 2 load balanced web servers once in a while and all http client requests to different hosts return :connect timeout on both web servers when it does. Enabled hackney trace and got the following. Restarting the process resolves this.

[hackney trace 80 <9427.17789.8> 2019:05:06 07:23:32 4836] request Content: [{module,hackney}, {line,293}, {method,post}, {url,{hackney_url,hackney_tcp,http,<<"[HOST]">>, <<"[PATH]">>,<<"[PATH]">>,<<>>, <<>>,"[HOST]",[PORT],<<>>,<<>>}}, {headers,[]}, {body,{form,[{<<"[KEY]">>,<<"[VAL]">>}, {<<"[KEY]">>,<<"[VAL]">>}, {<<"[KEY]">>,<<"[VAL]">>}]}}, {options,[]}] [hackney trace 60 <9427.17789.8> 2019:05:06 07:23:32 4836] request without proxy Content: [{module,hackney},{line,665}] [hackney trace 60 <9427.17789.8> 2019:05:06 07:23:32 4836] connect Content: [{module,hackney_connect}, {line,33}, {transport,hackney_tcp}, {host,"[HOST]"}, {port,[PORT]}, {dynamic,true}] {:error, [reason: :connect_timeout]}

May 06 '19 07:05 larryweya

Restarting the process resolves this.

@larryweya sounds like this is a slightly different issue than mine (but maybe related? @benoitc will know) since my :connect_timeout errors actually disappear after a few retries.

For the record, I'm still experiencing this in hackney 1.15.1.

May 13 '19 08:05 svileng

hackney hackney copied to clipboard

:connect_timeout errors on initial batch of requests

hackney
hackney copied to clipboard