hackney icon indicating copy to clipboard operation
hackney copied to clipboard

Hackney checkout_failure which was working before 1.16.0

Open SoniCoder opened this issue 3 years ago • 15 comments

2021-06-14 13:47:38.633 [warning] <0.1002.0>@butler_client_hackney:request:121 HTTP Request unsuccessful: Reason checkout_failure, Retrying after 4000 Request: {"http://192.168.5.107:8080/api-gateway/auth-service/platform-auth/oauth/token",[{"Authorization","Basic ABCD"},{"Cache-Control","no-cache"}],"application/x-www-form-urlencoded","grant_type=client_credentials&client_id=butlerx&client_secret=butlerx"} Method: post

This error goes away when I use: {use_default_pool, false}

SoniCoder avatar Jun 14 '21 13:06 SoniCoder

did you try latest master?

benoitc avatar Jun 16 '21 20:06 benoitc

Yes tried

On Thu, Jun 17, 2021, 1:34 AM Benoit Chesneau @.***> wrote:

did you try latest master?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benoitc/hackney/issues/689#issuecomment-862675475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACREQNKU5BRZ5TBSROVW2CDTTD7TJANCNFSM46VIYKYA .

SoniCoder avatar Jun 17 '21 02:06 SoniCoder

can you send a trace of the request? Can you share a snippet? also ho many connections in the pool?

benoitc avatar Jun 17 '21 08:06 benoitc

We started seeing this too with a connection pool size set to 400. We've been running f87678194 since it was merged into master.

Unfortunately I don't have much useful information to share yet, but if we come up with something useful I will post it here. So far the only interesting thing was noticing that the message queue of the hackney_pool process started growing quickly at a linear rate so something was stopping it from responding to incoming messages. We saw memory usage on the node increase as well but that might just be the message queue size increasing.

y axis is message queue length, x is time:

image

The only other piece of vague data we have is that this might correlate with having very slow hosts responding to us. It's not consistent though, it only happens in select cases, and not to all nodes.

If we learn more we will let you know!

Thanks for all your hard work on this library @benoitc, I know being a sole maintainer is a very hard job, hopefully I will have something helpful after we try to reproduce the issue.

mattbaker avatar Jul 07 '21 17:07 mattbaker

I was facing this issue with the default hackney settings

SoniCoder avatar Jul 07 '21 17:07 SoniCoder

Yeah, unfortunately we need the connection pool.

mattbaker avatar Jul 07 '21 20:07 mattbaker

@mattbaker Can you share a s trace and snippet of the code doing the request?

benoitc avatar Jul 07 '21 21:07 benoitc

Yeah I'll work on finding an example, there's some abstraction that makes a simple copy-paste example hard

mattbaker avatar Jul 07 '21 21:07 mattbaker

@SoniCoder @mattbaker If you've not already solved this you may want to check if you have somehow pinned an older version of unicode_util_compat which does not contain the string_compat module. Unfortunately xref on your top-level project won't find that.

I was running into exactly this issue when upgrading hackney on a project and after debugging discovered I had a manual include of an older version which was causing exactly this checkout_failure.

Temporarily added extra trace point of the try/catch in hackney_pool:checkout/4 and the stack trace showed the missing module.

[hackney trace 60 <0.2606.0> 2021:09:21 15:44:07 4274] checkout failure 
   Content: [{module,hackney_pool},
             {line,71},
             {error,undef,
                 [{string_compat,to_lower,["google.com"],[]},
                  {hackney_connection,new,1,
                      [{file,
                           "....hackney_connection.erl"},
                       {line,32}]},
                  {hackney_pool,do_checkout,5,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,97}]},
                  {hackney_pool,'-checkout/4-fun-0-',6,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,69}]}]}]
[hackney trace 80 <0.2551.0> 2021:09:21 15:44:07 4274] connect error ~p 
   Content: [{module,hackney_connect},{line,233},{error,checkout_failure}]

leonardb avatar Sep 21 '21 15:09 leonardb

I don't :/ Thanks though! We ended up switching to Finch

mattbaker avatar Sep 21 '21 16:09 mattbaker

we also saw checkout_failure when traffic go through proxy

mjaow avatar Nov 11 '21 11:11 mjaow

Just bumping this up. Started encountering this after an elixir upgrade. Currently running on 1.10.4 and using httpoison 1.7.0, hackney 1.17.4 I found some vague options online and tried playing with downgrading/upgrading versions, but could not get it to work. P.S. To clarify: I'm using httpoison, but I would assume the error bubbles up from hackney.

wizardone avatar Jan 18 '22 20:01 wizardone

Seeing this now after upgrading to OTP 24.3. Maybe related to https://github.com/erlang/otp/issues/5783?

mruoss avatar Mar 13 '22 19:03 mruoss

I have been looking into this for the last couple weeks for a project at work. We are using 1.18.1, which is currently the latest release on hex.

The issue seems to be related to the hackney_connections gen server. I have measured the mailbox size (erlang:process_info(Pid, message_queue_len)) being over 500k messages. Killing the process and having it restart seems to temporarily "fix" the issue until it comes back again.

I can start a new ticket if needed, but can you @benoitc describe what exactly this process is meant for? I can't seem to find any reference to the id elsewhere in the code other than creating and setting it in the #connection record. Is this something planned for later? As it stands, this is happening before the pool gets involved, so this genserver is acting as a bottleneck for every request that is attempting to go through hackney.

We are compiling against OTP 25.1.2.

I would be more than happy to run any tests in order to help find and fix this issue. The problem is that I do not know how to trigger the issue.

ankhers avatar Oct 26 '22 15:10 ankhers

I believe we fixed our issue.

We had a function that looked something like this.

foo(A, B) ->
  case parse_string(A, B) of
    {ok, Val} -> Val;
    {error, {_, reason}} ->
      hackney:request ...
      
      parse_string(?DEFAULT_VALUE, B)
  end.

Essentially if the function returned the error variant, we were not matching on the actual error we cared about for the retry. So when we received a different error message, we were calling this function infinitely. And because of that, we were sending an infinite number of requests to hackney. Once we had multiple processes calling this function, it was filling up the hackney_connections mailbox.


For anyone that may be having this issue, the following is how I was able to find out where the issue was coming from.

To start, I needed to crawl through hackneys code in order to figure out what was actually calling the checkout_failure. In my case, I found out it was because the hackney_connections genserver was timing out. After this I took a look into that processes mailbox using the following code in a remote shell on the node that was having issues.

1> Pid = erlang:whereis(hackney_connections).
2> erlang:process_info(Pid, message_queue_len).

Here I found out the process had over 500k messages in the mailbox. So I took a look into the mailbox to see what those messages were.

3> {messages, Messages} = erlang:process_info(Pid, messages)

That dumped me out the mailbox at that particular point in time. I then used the following (this may not be exact, I don't have the code on my machine and I may be getting the format slightly wrong here)

4> Domains = lists:foldl(fun({'$gen_call', _, {lookup, Domain, _, _}}, Acc) -> 
  case maps:get(Domain, Acc) of
    {bad_key, _} -> maps:put(Domain, 1, Acc);
    Val -> maps:put(Domain Val + 1, Acc)
end, Messages).

Which gave me a map with the domains being the key and the value being the number of requests in the mailbox for that given domain.

At this point I had an idea of where the requests were going.

So I wrote a little gen server that could run and constantly checked the hackney_connections message queue length, and when it went over 1000 messages, I triggered recon_trace:calls({hackney, request, return_trace}, 100). This told me the information that was sent by hackney and it was fairly trivial to find in our code where this was being sent from. Which is where we noticed that recursive function mentioned above.

I hope this can help someone else find issues in the future.

ankhers avatar Oct 28 '22 13:10 ankhers