hackney
hackney copied to clipboard
Reused socket closing connection
Hello!
I am having some problems related to the sockets being reused in my pool. For some reason, it's returning {error, closed}
when hackney calls hackney_tcp:recv(Socket, Length, Timeout)
.
To illustrate the situation, I've created some test cases for my remote server. I perform some requests using hackney and after 5 seconds I start it again, so it's in a loop. In the low test, I make 3 requests per loop.
But I've noticed that in some cases, the request was returning {error, closed}. I took a look at the methods that my request should call and added some prints to the code.
I decided to print when the socket was connected and also when became reused. To have a better idea which socket was disconnecting, I decided to print inside the hackney_tcp: recv
the info of the socket. So the code looks like:
recv(Socket, Length, Timeout) ->
SocketInfo = {sockname(Socket), peername(Socket)},
io:format("RCV ~p~n", [SocketInfo]),
case gen_tcp:recv(Socket, Length, Timeout) of
{error, closed} ->
io:format("ERROR ~p~n", [SocketInfo]),
{error, closed};
R -> R
end.
After executing the test, I got the following logs:
note that as I call 3 requests per iteration, hackney create (CONN) / reuse a connection and call rcv. I log the operation with the info of the socket {sockname(Socket), peername(Socket)}
First Iteration:
CONN {{ok,{{10,36,20,25},64003}}, {ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},64003}}, {ok,{{10,39,252,202},8082}}} CONN {{ok,{{10,36,20,25},14409}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},14409}},{ok,{{10,39,252,202},8082}}} CONN {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}}
Second Iteration:
CONN {{ok,{{10,36,20,25},39219}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},39219}},{ok,{{10,39,252,202},8082}}} REUSE {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}} REUSE {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}}
Third Iteration:
REUSE {{ok,{{10,36,20,25},64003}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},64003}},{ok,{{10,39,252,202},8082}}} REUSE {{ok,{{10,36,20,25},14409}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},14409}},{ok,{{10,39,252,202},8082}}} REUSE {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},22965}},{ok,{{10,39,252,202},8082}}}
Last Iteration (when I got the error):
Inside the error case of the rcv function, If I print {sockname(Socket), peername(Socket)}, It gives me {ebadf , econnrefused}
REUSE {{ok,{{10,36,20,25},64003}},{ok,{{10,39,252,202},8082}}} RCV {{ok,{{10,36,20,25},64003}},{ok,{{10,39,252,202},8082}}} ERROR {{ok,{{10,36,20,25},64003}},{ok,{{10,39,252,202},8082}}}
I'd like to know if there's any solution for this kind of problem or also if this is a known thing. I have already tested other server using the same approach but this is the first time I get this kind of error. I've tested without reusing sockets and it works ok. So I think it's very related to the reuse of old sockets.
I'd be very thankful to any help! Sorry If it's confused, I tried to explain in detail. Any doubt just ask me!
As a follow-up to our discussion on slack, just wanted to add more infos.
Current pool is checking if the socket is still active before returning it to the client. It may happen however that the socket is closed after that check. It's not always possible to prevent an EBADF error. Does it happen on the first read or later? Maybe you could provide a coredump?
In any case check if remote is not closing the socket. If you can provide me a coredump I will have a look over the week-end
I discovered what was probably the main cause of the error. I should had provided a feedback, sorry. What happened was that I am using cowboy on the remote as the listener of my service. Cowboy has a option called request_timeout and according to the docs, it's definition is: Time in ms with no requests before Cowboy closes the connection.
The default value of this option is 5000, the exactly same interval delay that I was setting on each l loop of my test. So in some situations, the socket was being reused for a new request near this timeout and being closed by the remote.
I printed the time the socket was connected for the first time and the time the connection was closed from this error, I calculated the interval between them and got the following results: 9999ms, 19999, 55317ms.
The first thing I tried was to increase the request_timeout from the remote to a high value, and the error stopped, making my assumption clearer to me.
I think that this cannot stop the error, it depends a lot about timing... And the only thing I though I could help me in this situation was having an option on the pool_handler, to define the max ttl of an inactive socket. I mean, with that I could set a value a bit lower than the timeout and with that avoid this kind of situation. But I don't know if would be worth it...
This is my hypostasis of what happened, I may be wrong, but my tests gave me a good understanding of the situation at all.
i think i'm agree with such solution, i aml also thinking a socket shouldn't be open too long in the pool to not exhaust the server resources or make it think of a bad behaviour . I propose to do the following change:
- [ ] default TTL in the pool for sockets shorten to 2 or 5 seconds (imo 2 seconds is a good timing)
- [ ] make this TTL configurable
- [ ] asynchronous requests should have an option to close on TTL so the client can handle better the case where cowboy is closing.
@gutiago btw sorry for the delay to answer I was side-tracked too much...
I think It's a great solution! I don't really know about the cost of creating a new socket on the pool, but let the socket stay for a long time in the pool wouldn't be nice. Making it configurable is even better, in my case I think that 5s or a bit higher would be better... Anyway, thanks for the reply!
tasks are sill valid let keep it open
Hi @benoitc , was this ever implemented? We're experiencing the same problem in which connections are randomly closed because they are being reused close to their timeout.
Perhaps connections that are about to expire shouldn't be used?
Or is it expected that the calling code should watch for :close
errors and retry?