hackney unhandled `{ssl_closed, _}` message when using `async`

I see the following message in my Elixir logs periodically:

Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.9056>, :tls_connection, :undefined}, #PID<0.10027.10>}}

That looks like a close message from the SSL socket, which I'd expect hackney to manage the same as a close of the TCP connection (which are handled properly). I'm using the async option to stream the response data as it comes in.

Jan 15 '18 14:01 paulswartz

I can reproduce this - I'm also using the async option with streaming. It seems that my chunk gets mixed in with it, as the last item in the tuple.

Unexpected message: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.8780>, :tls_connection, :undefined}, #PID<0.380.0>}, <<250, 148, 222, 99, 116, 39, 82, 167, 182, 77, 245, 137, 189, 33, 60, 168, 182, 240, 30, 55, 235, 210, 208, 103, 217, 48, 245, 8, 29, 92, 27, 76, 134, 173, 101, 105, 228, 74, 31, 239, 5, 67, 94, 174, 203, 215, 11, 155, ...>>}

Update/Edit: When I run the streaming process in an Elixir Task, the message doesn't occur. Which makes me think there's a mailbox issue?

Only occurs in our deployment environment, (cloudfoundry, ubuntu trusty). Locally on OSX 10.12.5, there's no issue.

Using hackney 1.11

code snippet:

   # options for request - [async: :once, stream_to: self()]

    defp handle_async(%{status: status} = conn, ref) do
        :ok = :hackney.stream_next(ref)

        receive do
            {:hackney_response, ^ref, {:status, code, _reason}} ->
                conn
                |> Plug.Conn.put_status(code)
                |> handle_async(ref)

            {:hackney_response, ^ref, {:headers, headers, }} ->
                downcased_headers =
                    headers
                    |> Stream.map(fn {k,v} -> {String.downcase(k), v} end)
                    |> Enum.filter(fn {k, _v} -> k !== "cache-control" end)

                conn
                |> Plug.Conn.merge_resp_headers(downcased_headers)
                |> Plug.Conn.delete_resp_header("content-length") #RFC spec dictates no content-length on chunked requests
                |> Plug.Conn.put_resp_header("connection", "close")
                |> Plug.Conn.send_chunked(status)
                |> handle_async(ref)

            {:hackney_response, ^ref, :done} ->
                conn

            {:hackney_response, ^ref, chunk} ->
                case Plug.Conn.chunk(conn, chunk) do
                    {:ok, new_conn} ->
                        handle_async(new_conn, ref)

                    {:error, :closed} ->
                        Logger.info("Client canceled request.")
                        :hackney.close(ref)
                        conn

                    {:error, reason } ->
                        Logger.error("Unable to complete request: #{inspect(reason)}")
                        :hackney.close(ref)
                        conn
                end

            message ->
                Logger.warn("Hackney.handle_async got an unexpected message: #{inspect(message)}")
                handle_async(conn, ref)
       end
    end

Jan 23 '18 17:01 matthewoden

hrm it's probably happening on close. @paulswartz is your code similar to the one posted by @matthewoden ?

Jan 23 '18 20:01 benoitc

I'm using hackney via httpoison, but here's the relevant code: https://github.com/edgurgel/httpoison/blob/master/lib/httpoison/base.ex#L369-L387

Jan 30 '18 14:01 paulswartz

If it matters, I've done a quick port on my code over to :ibrowse, and the issue hasn't come up again. The receive block is basically the same (though ibrowse combines status and headers into a single message).

I'd prefer to use hackney, so just let me know if you need any other usage info.

Jan 30 '18 15:01 matthewoden

well the issue is that a message is still on the process heap after cloising the socket. I have a fix for it. Will take care of it later today. Thanks for the code!

Jan 30 '18 15:01 benoitc

can you try #482?

Feb 27 '18 22:02 benoitc

can anyone provides me a link on which I can reproduce the issue?

Mar 01 '18 15:03 benoitc

I can't provide our source, but I'll see if I can get #482 out in a test environment today or tomorrow. (Edit: I've never been able to reproduce it locally, even after dockerizing the thing.)

Mar 01 '18 15:03 matthewoden

This https://github.com/stephenmoloney/fastimage/pull/24 reproduces it everytime - but on travis only.

Like @matthewoden, I also can't reproduce it locally.... probably something to do with loopbacks? My tcp knowledge is not good enough...

This is in stdout on travis container startup...

127.0.0.1	localhost
::1	 ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters

Mar 01 '18 20:03 stephenmoloney

@stephenmoloney do you mean the link? I need to reporduce it apart of the another lib ...

Mar 01 '18 20:03 benoitc

@benoitc sorry, I don't full understand... When the tests run, that error is printed...

https://travis-ci.org/stephenmoloney/fastimage/jobs/346210674

I still didn't get time to try the tracing.

Mar 01 '18 22:03 stephenmoloney

@benoitc Take your example of async receiving in the documentation, convert to require stream_next, and put a sleep of a second or more between requesting stream_next. This is how I have done it via elixir with only hackney. There is something possibly a race condition of receiving the last of the web request and the receiving a message to stream_next. The stream has fully closed, but the buffer has not been flushed out and the done message sent.

Mar 06 '18 20:03 critch

@critch thanks. Quite possible indeed. I think that rather than pausing i should rather receive actively at some point. I will have a closer looks, thanks for the reproducible way.

Mar 06 '18 21:03 benoitc

fixed in latest master.

Apr 03 '18 12:04 benoitc

I still see this happening with hackney 1.12.1. It definitely seems to be a race condition, as it happens sometimes.

One thing I don't understand from d4c4f3930d633a675a4224eb76149646c1e6cd74 is that the changes all seem to be around closing the socket, but I see this error happen long before the socket has closed - just after the response stream starts.

I'm going to keep looking into it.

Jun 11 '18 20:06 rozap

I've got two traces, one where it works and one where it crashes.

This is part of the working trace that differs from the broken one.

13:20:52:826799 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826812 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826826 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130819>},
                                            {set_opts,[{active,once}]}}
13:20:52:826862 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130819>,ok}
13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
                                            {set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}
13:20:52:827039 (<0.931.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.27730>,tls_connection,undefined},
                                               <0.929.0>},

This is the broken one

13:06:37:122289 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122300 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122332 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
                                              {<0.1064.0>,#Ref<0.280439725.1680343041.134667>},
                                              {set_opts,[{active,once}]}}
13:06:37:122368 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134667>,ok}
13:06:37:122547 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
                                              {<0.1064.0>,#Ref<0.280439725.1680343041.134668>},
                                              {set_opts,[{active,once}]}}
13:06:37:122562 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
                                                <0.1062.0>},
                                     <<"datadatadata">>}
13:06:37:122691 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
                                                <0.1062.0>},
                                     <<"datadatadata">>}
13:06:37:122696 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134668>,ok}
13:06:37:122743 (<0.1064.0>) <0.635.0> ! {hackney_response,#Ref<0.280439725.1680343042.127415>,
                                             <<"datadatadata">>}
13:06:37:122753 (<0.1064.0>) error_logger ! {notify,
                                             {error,<0.324.0>,
                                              {<0.1064.0>,"Unexpected message: ~w~n",
                                               [{ssl,
                                                 {sslsocket,
                                                  {gen_tcp,#Port<0.28349>,tls_connection,
                                                   undefined},
                                                  <0.1062.0>},
                                                 <<"datadatadata"}]}}}

If you look at them side by side, we see the working one has

13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
                                            {set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}

But the broken one has the ok response from setting the transport opts come after the first chunks of data. Its first message after setting transport_opts is actual data from the socket.

So I think the Transport:setopts(TSock, [{active, once}]), line in hackney_stream.erl or Transport:setopts(Socket, [{active, once}, {packet, 0}]), in hackney_pool.erl is not setting the socket up in time for the first chunk to get there? That was the only different between the traces that I found.

Jun 11 '18 21:06 rozap

Just going to leave a comment here if anyone else encounters this behavior: the reason for this seemed to be calling stream_next twice in a row without waiting for an ack after the socket was set up. Removing the duplicate call I had seems to have fixed it.

Jun 12 '18 04:06 rozap

@rozap you mean 2 consecutive call without receiving the message? can you show a simple snippet for it?

just fyi right after the next release i will revisit the stream code to make it simpler to handle, but if we can fix that issue firsy/

Jun 12 '18 08:06 benoitc

I mean in my app code, I was accidentally calling stream_next twice without receiving a message, yes. A snippet probably won't be super useful, because there is some complexity about my app mixed in there :)

Jun 12 '18 16:06 rozap

Any updates on this as I'm still seeing this behavior in version 1.14.3?

Dec 18 '18 19:12 billgloff

@billgloff how does it happen? Can you open a ticket with more details if you can?

Dec 19 '18 13:12 benoitc

I'm also seeing this again; it appears to have started in earnest after updating to Erlang 21.2 even without using async. I don't have more information at the moment but I'm investigating on my end.

Hackney: 1.14.3 HTTPoison: 1.4 and 1.5

Things I have unsuccessfully tried:

forcing "tls1.2" as the SSL version
going back to the {active, 1} configuration that was the default pre-21.2

Dec 21 '18 13:12 paulswartz

We've also started seeing this error in our logs after an update to Erlang 21.2

{:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.1490365>, :tls_connection, :undefined}, [#PID<0.21817.64>, #PID<0.21816.64>]}}

Could this be related to the change in the ssl app behavior from active once to "active N" for TLS connections?

Jan 04 '19 23:01 dbhobbs

It does seem related to that. There's a PR which may address it: https://bugs.erlang.org/browse/ERL-371

Jan 04 '19 23:01 paulswartz

@dbhobbs we are experiencing the same issue, did you find a work around? I can also confirm this is specific to 21.2, 21.1 does not have this issues.

@paulswartz that change, unfortunately, did not resolve the error. I have tested 21.2.5, 21.2.4, 21.2.3, and 21.2.2, all of which encounter this error.

I'm curious if anyone else has a fix or workaround?

Feb 05 '19 02:02 binarylogic

@binarylogic We haven't found a fix/workaround. Our application fetches very frequently (every few seconds) so having some prematurely closed connections is manageable.

Feb 06 '19 14:02 paulswartz

@binarylogic we have not found a workaround. Fortunately for us the error occurs in our pubsub connection, so we just receive the message again after it fails. I've been waiting for a chance to update to 21.2.5 to see if that mitigates the issue at all.

Feb 07 '19 04:02 dbhobbs

@binarylogic do you have a simple way to reproduce it? maybe a simple snippet showing how the async request is handled is enough

Feb 08 '19 08:02 benoitc

Hi @benoitc! I'm also having this issue. I think it's time to reopen it :wink: Anyway, here's the snippet that reproduces the problem (disclaimer - I don't know Erlang, so it's in Elixir):

iex(1)> url = "https://www.openssl.org/source/openssl-1.1.1a.tar.gz"
iex(2)> {:ok, resp} = :hackney.get(url, [], "", async: :once)
{:ok, #Reference<0.3880280478.3613130753.31696>}
iex(3)> :hackney.stream_next(resp)
:ok
iex(4)> :hackney.stream_next(resp)
:ok
iex(5)> :hackney.stream_next(resp)
:ok
iex(6)> flush
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
 {:status, 200, "OK"}}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
 {:headers,
  [
    {"Server", "Apache/2.4.18 (Ubuntu)"},
    {"Strict-Transport-Security",
     "max-age=31536000; includeSubDomains; preload"},
    {"Last-Modified", "Tue, 20 Nov 2018 14:07:08 GMT"},
    {"ETag", "\"7f6b53-57b192634f900\""},
    {"Accept-Ranges", "bytes"},
    {"Content-Length", "8350547"},
    {"Content-Type", "application/x-gzip"},
    {"Cache-Control", "max-age=165527"},
    {"Expires", "Thu, 14 Feb 2019 09:33:40 GMT"},
    {"Date", "Tue, 12 Feb 2019 11:34:53 GMT"},
    {"Connection", "keep-alive"}
  ]}}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
 <<31, 139, 8, 0, 174, 13, 244, 91, 2, 3, 236, 60, 127, 119, 218, 198, 178, 249,
   183, 124, 138, 45, 166, 137, 192, 24, 16, 254, 153, 56, 164, 37, 14, 73, 124,
   74, 192, 199, 198, 77, 251, 156, 148, 35, 164, 5, 84, ...>>}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
 <<147, 53, 108, 2, 227, 253, 48, 30, 63, 212, 126, 235, 87, 254, 121, 152, 74,
   51, 93, 253, 58, 46, 13, 49, 217, 130, 184, 152, 68, 224, 225, 72, 208, 120,
   188, 4, 180, 85, 245, 47, 61, 115, 63, 121, 168, 218, 190, ...>>}
:ok

# Now wait for up to 10 minutes

12:43:14.870 [error] Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp,
 #Port<0.1468>,tls_connection, :undefined}, [#PID<0.1063.0>, #PID<0.1062.0>]}}
iex(7)> :hackney.stream_next(resp)                       
{:error, :req_not_found}

This is Elixir 1.8 and OTP 21.2.3. I've managed to trace this message, it is printed from here: https://github.com/benoitc/hackney/blob/2f55c0d37f9d5942b976272a1df5adecca4edc61/src/hackney_stream.erl#L152

There was a fix in OTP 21.2.3 (https://bugs.erlang.org/browse/ERL-371) in ssl application that prevents sending the :ssl_closed message before all the data was sent, that's why there are stream_next calls in the snippet above - the next message was supposed to be {:hackney_response, _, :done} and since 21.2.3 only then is the :ssl_closed message received. Before that version, the response could be closed before getting all the data.

Feb 12 '19 12:02 bblaszkow06

One more detail: there's no need to wait 10mins for :ssl_closed if there's no {"Connection", "keep-alive"} header.

url = "https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"
"https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"
iex(34)> {:ok, resp} = :hackney.get(url, [], "", async: :once)
{:ok, #Reference<0.1509818611.3649830913.19500>}
iex(35)> :hackney.stream_next(resp)
:ok
iex(36)> flush
{:hackney_response, #Reference<0.1509818611.3649830913.19749>,
 {:status, 302, "Found"}}
{:hackney_response, #Reference<0.1509818611.3649830913.19749>,
 {:headers,
  [
    {"Date", "Tue, 12 Feb 2019 15:14:10 GMT"},
    {"Server", "Apache/2.4.37 (Fedora) mod_wsgi/4.6.4 Python/2.7"},
    {"X-Frame-Options", "SAMEORIGIN"},
    {"X-Xss-Protection", "1; mode=block"},
    {"X-Content-Type-Options", "nosniff"},
    {"Referrer-Policy", "same-origin"},
    {"Content-Length", "0"},
    {"Location",
     "http://ftp.man.poznan.pl/pub/linux/fedora/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"},
    {"Content-Type", "text/html; charset=UTF-8"},
    {"AppTime", "D=8655"},
    {"AppServer", "proxy04.fedoraproject.org"}
  ]}}
:ok

# 10-15 seconds later

16:08:12.738 [error] Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.24>, 
:tls_connection, :undefined}, [#PID<0.319.0>, #PID<0.318.0>]}}

Feb 12 '19 15:02 bblaszkow06

hackney hackney copied to clipboard

unhandled `{ssl_closed, _}` message when using `async`

hackney
hackney copied to clipboard