hackney
hackney copied to clipboard
unhandled `{ssl_closed, _}` message when using `async`
I see the following message in my Elixir logs periodically:
Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.9056>, :tls_connection, :undefined}, #PID<0.10027.10>}}
That looks like a close message from the SSL socket, which I'd expect hackney to manage the same as a close of the TCP connection (which are handled properly). I'm using the async
option to stream the response data as it comes in.
I can reproduce this - I'm also using the async option with streaming. It seems that my chunk gets mixed in with it, as the last item in the tuple.
Unexpected message: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.8780>, :tls_connection, :undefined}, #PID<0.380.0>}, <<250, 148, 222, 99, 116, 39, 82, 167, 182, 77, 245, 137, 189, 33, 60, 168, 182, 240, 30, 55, 235, 210, 208, 103, 217, 48, 245, 8, 29, 92, 27, 76, 134, 173, 101, 105, 228, 74, 31, 239, 5, 67, 94, 174, 203, 215, 11, 155, ...>>}
Update/Edit: When I run the streaming process in an Elixir Task, the message doesn't occur. Which makes me think there's a mailbox issue?
Only occurs in our deployment environment, (cloudfoundry, ubuntu trusty). Locally on OSX 10.12.5, there's no issue.
Using hackney 1.11
code snippet:
# options for request - [async: :once, stream_to: self()]
defp handle_async(%{status: status} = conn, ref) do
:ok = :hackney.stream_next(ref)
receive do
{:hackney_response, ^ref, {:status, code, _reason}} ->
conn
|> Plug.Conn.put_status(code)
|> handle_async(ref)
{:hackney_response, ^ref, {:headers, headers, }} ->
downcased_headers =
headers
|> Stream.map(fn {k,v} -> {String.downcase(k), v} end)
|> Enum.filter(fn {k, _v} -> k !== "cache-control" end)
conn
|> Plug.Conn.merge_resp_headers(downcased_headers)
|> Plug.Conn.delete_resp_header("content-length") #RFC spec dictates no content-length on chunked requests
|> Plug.Conn.put_resp_header("connection", "close")
|> Plug.Conn.send_chunked(status)
|> handle_async(ref)
{:hackney_response, ^ref, :done} ->
conn
{:hackney_response, ^ref, chunk} ->
case Plug.Conn.chunk(conn, chunk) do
{:ok, new_conn} ->
handle_async(new_conn, ref)
{:error, :closed} ->
Logger.info("Client canceled request.")
:hackney.close(ref)
conn
{:error, reason } ->
Logger.error("Unable to complete request: #{inspect(reason)}")
:hackney.close(ref)
conn
end
message ->
Logger.warn("Hackney.handle_async got an unexpected message: #{inspect(message)}")
handle_async(conn, ref)
end
end
hrm it's probably happening on close. @paulswartz is your code similar to the one posted by @matthewoden ?
I'm using hackney via httpoison, but here's the relevant code: https://github.com/edgurgel/httpoison/blob/master/lib/httpoison/base.ex#L369-L387
If it matters, I've done a quick port on my code over to :ibrowse
, and the issue hasn't come up again. The receive block is basically the same (though ibrowse combines status and headers into a single message).
I'd prefer to use hackney, so just let me know if you need any other usage info.
well the issue is that a message is still on the process heap after cloising the socket. I have a fix for it. Will take care of it later today. Thanks for the code!
can you try #482?
can anyone provides me a link on which I can reproduce the issue?
I can't provide our source, but I'll see if I can get #482 out in a test environment today or tomorrow. (Edit: I've never been able to reproduce it locally, even after dockerizing the thing.)
This https://github.com/stephenmoloney/fastimage/pull/24
reproduces it everytime - but on travis only.
Like @matthewoden, I also can't reproduce it locally.... probably something to do with loopbacks? My tcp knowledge is not good enough...
This is in stdout on travis container startup...
127.0.0.1 localhost
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
@stephenmoloney do you mean the link? I need to reporduce it apart of the another lib ...
@benoitc sorry, I don't full understand... When the tests run, that error is printed...
https://travis-ci.org/stephenmoloney/fastimage/jobs/346210674
I still didn't get time to try the tracing.
@benoitc Take your example of async receiving in the documentation, convert to require stream_next, and put a sleep of a second or more between requesting stream_next. This is how I have done it via elixir with only hackney. There is something possibly a race condition of receiving the last of the web request and the receiving a message to stream_next. The stream has fully closed, but the buffer has not been flushed out and the done message sent.
@critch thanks. Quite possible indeed. I think that rather than pausing i should rather receive actively at some point. I will have a closer looks, thanks for the reproducible way.
fixed in latest master.
I still see this happening with hackney 1.12.1
. It definitely seems to be a race condition, as it happens sometimes.
One thing I don't understand from d4c4f3930d633a675a4224eb76149646c1e6cd74
is that the changes all seem to be around closing the socket, but I see this error happen long before the socket has closed - just after the response stream starts.
I'm going to keep looking into it.
I've got two traces, one where it works and one where it crashes.
This is part of the working trace that differs from the broken one.
13:20:52:826799 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826812 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826826 (<0.931.0>) <0.929.0> ! {'$gen_call',
{<0.931.0>,#Ref<0.3355659064.3830185990.130819>},
{set_opts,[{active,once}]}}
13:20:52:826862 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130819>,ok}
13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
{<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
{set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}
13:20:52:827039 (<0.931.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.27730>,tls_connection,undefined},
<0.929.0>},
This is the broken one
13:06:37:122289 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122300 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122332 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
{<0.1064.0>,#Ref<0.280439725.1680343041.134667>},
{set_opts,[{active,once}]}}
13:06:37:122368 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134667>,ok}
13:06:37:122547 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
{<0.1064.0>,#Ref<0.280439725.1680343041.134668>},
{set_opts,[{active,once}]}}
13:06:37:122562 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
<0.1062.0>},
<<"datadatadata">>}
13:06:37:122691 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
<0.1062.0>},
<<"datadatadata">>}
13:06:37:122696 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134668>,ok}
13:06:37:122743 (<0.1064.0>) <0.635.0> ! {hackney_response,#Ref<0.280439725.1680343042.127415>,
<<"datadatadata">>}
13:06:37:122753 (<0.1064.0>) error_logger ! {notify,
{error,<0.324.0>,
{<0.1064.0>,"Unexpected message: ~w~n",
[{ssl,
{sslsocket,
{gen_tcp,#Port<0.28349>,tls_connection,
undefined},
<0.1062.0>},
<<"datadatadata"}]}}}
If you look at them side by side, we see the working one has
13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
{<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
{set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}
But the broken one has the ok
response from setting the transport opts come after the first chunks of data. Its first message after setting transport_opts is actual data from the socket.
So I think the Transport:setopts(TSock, [{active, once}]),
line in hackney_stream.erl
or Transport:setopts(Socket, [{active, once}, {packet, 0}]),
in hackney_pool.erl
is not setting the socket up in time for the first chunk to get there? That was the only different between the traces that I found.
Just going to leave a comment here if anyone else encounters this behavior: the reason for this seemed to be calling stream_next
twice in a row without waiting for an ack after the socket was set up. Removing the duplicate call I had seems to have fixed it.
@rozap you mean 2 consecutive call without receiving the message? can you show a simple snippet for it?
just fyi right after the next release i will revisit the stream code to make it simpler to handle, but if we can fix that issue firsy/
I mean in my app code, I was accidentally calling stream_next twice without receiving a message, yes. A snippet probably won't be super useful, because there is some complexity about my app mixed in there :)
Any updates on this as I'm still seeing this behavior in version 1.14.3?
@billgloff how does it happen? Can you open a ticket with more details if you can?
I'm also seeing this again; it appears to have started in earnest after updating to Erlang 21.2 even without using async
. I don't have more information at the moment but I'm investigating on my end.
Hackney: 1.14.3 HTTPoison: 1.4 and 1.5
Things I have unsuccessfully tried:
- forcing
"tls1.2"
as the SSL version - going back to the
{active, 1}
configuration that was the default pre-21.2
We've also started seeing this error in our logs after an update to Erlang 21.2
{:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.1490365>, :tls_connection, :undefined}, [#PID<0.21817.64>, #PID<0.21816.64>]}}
Could this be related to the change in the ssl
app behavior from active once to "active N" for TLS connections?
It does seem related to that. There's a PR which may address it: https://bugs.erlang.org/browse/ERL-371
@dbhobbs we are experiencing the same issue, did you find a work around? I can also confirm this is specific to 21.2
, 21.1
does not have this issues.
@paulswartz that change, unfortunately, did not resolve the error. I have tested 21.2.5
, 21.2.4
, 21.2.3
, and 21.2.2
, all of which encounter this error.
I'm curious if anyone else has a fix or workaround?
@binarylogic We haven't found a fix/workaround. Our application fetches very frequently (every few seconds) so having some prematurely closed connections is manageable.
@binarylogic we have not found a workaround. Fortunately for us the error occurs in our pubsub connection, so we just receive the message again after it fails. I've been waiting for a chance to update to 21.2.5 to see if that mitigates the issue at all.
@binarylogic do you have a simple way to reproduce it? maybe a simple snippet showing how the async request is handled is enough
Hi @benoitc! I'm also having this issue. I think it's time to reopen it :wink: Anyway, here's the snippet that reproduces the problem (disclaimer - I don't know Erlang, so it's in Elixir):
iex(1)> url = "https://www.openssl.org/source/openssl-1.1.1a.tar.gz"
iex(2)> {:ok, resp} = :hackney.get(url, [], "", async: :once)
{:ok, #Reference<0.3880280478.3613130753.31696>}
iex(3)> :hackney.stream_next(resp)
:ok
iex(4)> :hackney.stream_next(resp)
:ok
iex(5)> :hackney.stream_next(resp)
:ok
iex(6)> flush
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
{:status, 200, "OK"}}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
{:headers,
[
{"Server", "Apache/2.4.18 (Ubuntu)"},
{"Strict-Transport-Security",
"max-age=31536000; includeSubDomains; preload"},
{"Last-Modified", "Tue, 20 Nov 2018 14:07:08 GMT"},
{"ETag", "\"7f6b53-57b192634f900\""},
{"Accept-Ranges", "bytes"},
{"Content-Length", "8350547"},
{"Content-Type", "application/x-gzip"},
{"Cache-Control", "max-age=165527"},
{"Expires", "Thu, 14 Feb 2019 09:33:40 GMT"},
{"Date", "Tue, 12 Feb 2019 11:34:53 GMT"},
{"Connection", "keep-alive"}
]}}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
<<31, 139, 8, 0, 174, 13, 244, 91, 2, 3, 236, 60, 127, 119, 218, 198, 178, 249,
183, 124, 138, 45, 166, 137, 192, 24, 16, 254, 153, 56, 164, 37, 14, 73, 124,
74, 192, 199, 198, 77, 251, 156, 148, 35, 164, 5, 84, ...>>}
{:hackney_response, #Reference<0.3880280478.3613130753.31696>,
<<147, 53, 108, 2, 227, 253, 48, 30, 63, 212, 126, 235, 87, 254, 121, 152, 74,
51, 93, 253, 58, 46, 13, 49, 217, 130, 184, 152, 68, 224, 225, 72, 208, 120,
188, 4, 180, 85, 245, 47, 61, 115, 63, 121, 168, 218, 190, ...>>}
:ok
# Now wait for up to 10 minutes
12:43:14.870 [error] Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp,
#Port<0.1468>,tls_connection, :undefined}, [#PID<0.1063.0>, #PID<0.1062.0>]}}
iex(7)> :hackney.stream_next(resp)
{:error, :req_not_found}
This is Elixir 1.8 and OTP 21.2.3. I've managed to trace this message, it is printed from here: https://github.com/benoitc/hackney/blob/2f55c0d37f9d5942b976272a1df5adecca4edc61/src/hackney_stream.erl#L152
There was a fix in OTP 21.2.3 (https://bugs.erlang.org/browse/ERL-371) in ssl
application that prevents sending the :ssl_closed
message before all the data was sent, that's why there are stream_next
calls in the snippet above - the next message was supposed to be {:hackney_response, _, :done}
and since 21.2.3 only then is the :ssl_closed
message received. Before that version, the response could be closed before getting all the data.
One more detail: there's no need to wait 10mins for :ssl_closed
if there's no {"Connection", "keep-alive"}
header.
url = "https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"
"https://download.fedoraproject.org/pub/fedora/linux/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"
iex(34)> {:ok, resp} = :hackney.get(url, [], "", async: :once)
{:ok, #Reference<0.1509818611.3649830913.19500>}
iex(35)> :hackney.stream_next(resp)
:ok
iex(36)> flush
{:hackney_response, #Reference<0.1509818611.3649830913.19749>,
{:status, 302, "Found"}}
{:hackney_response, #Reference<0.1509818611.3649830913.19749>,
{:headers,
[
{"Date", "Tue, 12 Feb 2019 15:14:10 GMT"},
{"Server", "Apache/2.4.37 (Fedora) mod_wsgi/4.6.4 Python/2.7"},
{"X-Frame-Options", "SAMEORIGIN"},
{"X-Xss-Protection", "1; mode=block"},
{"X-Content-Type-Options", "nosniff"},
{"Referrer-Policy", "same-origin"},
{"Content-Length", "0"},
{"Location",
"http://ftp.man.poznan.pl/pub/linux/fedora/releases/29/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29-1.2.iso"},
{"Content-Type", "text/html; charset=UTF-8"},
{"AppTime", "D=8655"},
{"AppServer", "proxy04.fedoraproject.org"}
]}}
:ok
# 10-15 seconds later
16:08:12.738 [error] Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.24>,
:tls_connection, :undefined}, [#PID<0.319.0>, #PID<0.318.0>]}}