finch
finch copied to clipboard
Connection closed when request is attempted
This might be a re-hash of these and other similar issues: https://github.com/sneako/finch/issues/62 https://github.com/elixir-tesla/tesla/issues/467
However, I indeed think something strange is happening, along with the other folks who commented on other issues, some very recent and mimic my experience. The issue is that certain requests fail with a {:error, "socket closed"} error, without actually being executed at all and not due to any real connectivity issue. As a sidenote, the {:error, "socket closed"] result just the {:error, %Mint.TransportError{reason: :closed} returned by Finch wrapped by the Tesla adapter. But that's beside the point, that just says that the adapter is an unlikely culprit.
These issues were much more common before I set the pool_max_idle_time to 60_000 down from the default :infinity, like 10 times more common. Yet the error still happens. I setup telemetry to try and gather more data on the issue. After capturing few such cases, this is the pattern I usually see (telemetry events forwarded to logs):
16:16:59.565 [info] Finch telemetry: reused_connection.
16:16:59.566 [info] Finch telemetry: http request sent, idle_time: 115862688122, method: DELETE, path: <redacted>
16:16:59.566 [info] Finch telemetry: receive: start, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.749 [info] Finch telemetry: pool idle timeout, host: <redacted>, :pool_max_idle_time_exceeded.
16:17:00.750 [info] Finch telemetry: receive: stop, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.750 <socket closed error here in my code>
...
<exponential backoff>
<retry succeeds with fresh connection with a sane idle_time>
Please notice two things, the idle_time is always a very large number when this happens, this is the equivalent of 3.7 years. Successful requests have an expected idle_time, less than 60 seconds. I also noticed (I believe always) that there is a Finch telemetry: pool idle timeout event when this error happens like immediately, a millisecond, before the connection closes. I didn't have telemetry from back when the pool_max_idle_time was at it's default (:infinity) but the the number of errors such was greater. It's a bit hard for me to revert it back to :infinity and test now easily because this code is deployed to an environment where it's being used and would increase error rate downstream potentially. And we need some traffic to make it happen, but if it would help debugging, I could try to make it work.
If the pool shuts down/restarts because of the idle time, I don't think it should have done that. There was more than a second delay after the request and the pool being closed on itself, it should have registered that it's still in use and stay alive with connections open. On the flip side, I don't have an explanation for the idle time, though, as it was a large number to begin with.
We see this a few times each day towards different hosts, maybe 2-6 events daily, but on a mildly utilised system. We expect more traffic soon. Also, we used Hackney before migrating to Finch and didn't see similar issues. Looking back at it, I don't think Hackney did any retries by itself. We migrated to Finch because of the more advanced connection pool, but we had to build a lot of defensive logic around this issue which we didn't need. Which is kind of fine, connections can be broken or unstable and it should be taken care , but this still bothers me because these are not real connectivity issues in my opinion :-) . I would rather improve on this and keep Finch than migrate back to Hackney.
Please advise which more information can I send. I'll continue looking on my side, but I'm not yet familiar with Finch's source code yet.
Edit: I've just seen this issue: https://github.com/sneako/finch/issues/269 and it really resembles my case, and we're indeed using OTP 26.1.2, but the OP had issues with any connection to the affected host. Mine are haphazard.
The idle times you're seeing there are based on System.monotonic_time which is in native units. If those are converted to seconds you end up with 115, a much more reasonable value, but still outside of the max idle time that's been set.
What you're seeing here makes sense based on the logic of the http1 pool. A connection is checked out, finch determines that its passed the idle time that has been specified, executes the telemetry event, and closes the connection. I suspect why you were seeing more of these errors in the past is because connections were being closed by the server and finch believed it still had an open connection when it fact it didn't. The max idle time helps avoid this because it causes finch to close connections more often. But somewhere in the chain of events, information is being lost or not handled properly. This same thing has plagued people in the past and as far as I know no one has been able to reliably reproduce it or get a packet capture or similar to see exactly what's going on.
The idle times you're seeing there are based on
System.monotonic_timewhich is in native units. If those are converted to seconds you end up with 115, a much more reasonable value, but still outside of the max idle time that's been set.What you're seeing here makes sense based on the logic of the http1 pool. A connection is checked out, finch determines that its passed the idle time that has been specified, executes the telemetry event, and closes the connection. I suspect why you were seeing more of these errors in the past is because connections were being closed by the server and finch believed it still had an open connection when it fact it didn't. The max idle time helps avoid this because it causes finch to close connections more often. But somewhere in the chain of events, information is being lost or not handled properly. This same thing has plagued people in the past and as far as I know no one has been able to reliably reproduce it or get a packet capture or similar to see exactly what's going on.
Thank you for your response and for explaining monotonic time. Well, if I'm understanding things correctly, 115 seconds is more than the pool_max_idle_time, but that doesn't make any difference nor should it. It's actually less than the conn_max_idle_time set to :infinity by default. If the pool has been relatively active, let's say requests were going out for several minutes using it, but that particular connection was not being checked out, it's idle time may have been so quite long once the connection was asked for. Consequently being closed. It seems that the conn_max_idle_time should be set to a lower duration, not necessarily the pool_max_idle_time. Does that make any sense?
However, the documentation states:
:conn_max_idle_time ([timeout/0]) - The maximum number of milliseconds an HTTP1 connection is allowed to be idle before being closed during a checkout attempt. The default value is :infinity.
~~What does that mean? Will Finch also return an error if the idle time is too long? I'll certainly look into the code and try it out, I'm just trying to piece together the desired behaviour.~~ Got it, it's a NimblePool feature. If everything is correct, NimblePool will remove the idle connection and fetch a new one.
Another question, can Finch actually detect that the connection is closed without trying an http request, moving to receive and returning an error? Can Finch just checkout another connection in that case or does it monitor connections?
Edit: Also, it's still not clear to me why is the pool closed in this case. It obviously wasn't idle since the connection was checked out more than a second before the pool was closed.
Another question, can Finch actually detect that the connection is closed without trying an http request, moving to receive and returning an error? Can Finch just checkout another connection in that case or does it monitor connections?
In theory, yes. When the connection is checked into the pool, the pool monitors incoming messages and tcp events and updates the sockets accordingly. What seems to be happening is something like this:
- finch establishes connection A with server
- connection A sits idle
- Server closes or half closes its socket to connection A
- Connection A is checked out of the pool
- A request is sent through connection A at which point its determined that the socket has been closed and an error is returned.
Its never been entirely clear how this can happen. There's a long issue where lots of folks tried to work out the root cause. But it was never really found: #62.
Just to add a bit more: @salex89 if there's any way you can get a packet capture of this happening, that would go a long way towards determining the root of the issue.
Just to add a bit more: @salex89 if there's any way you can get a packet capture of this happening, that would go a long way towards determining the root of the issue.
I talked to our ops team and we should be able to do that. At least on our staging environment which should be enough since we happen to see these events there. Do you have any suggestion on which monitor/capture to use? Maybe tshark, tcpdump, something third?
@salex89 awesome! either tshark or tcpdump are what I would use so I don't think you'll go wrong with either. Thanks.
I had a similar issue. I'm not sure if this is exactly the same thing, but here is what happened:
- Our application starts.
- We made multiple requests to an external API using
Reqeverything is fine. - We stopped making requests for quite a while.
- When we tried to request the external API again, the first request failed with
%Mint.TransportError{reason: :closed}, but the subsequent ones were successful.
What happened was :
pool_max_idle_timewas:infinity- We are on AWS and the NAT configuration is set to drop any idle connection after 350s
- I guess the first request fails due to socket being closed. The second one creates a new pool and it works again
- After switching to a
pool_max_idle_timejust under the 350s all requests are completed correctly
Looks like it's possible to detect this behavior since AWS says that a RST packet is sent before closing the connection.
I had a similar issue. I'm not sure if this is exactly the same thing, but here is what happened:
- Our application starts.
- We made multiple requests to an external API using
Reqeverything is fine.- We stopped making requests for quite a while.
- When we tried to request the external API again, the first request failed with
%Mint.TransportError{reason: :closed}, but the subsequent ones were successful.What happened was :
pool_max_idle_timewas:infinity- We are on AWS and the NAT configuration is set to drop any idle connection after 350s
- I guess the first request fails due to socket being closed. The second one creates a new pool and it works again
- After switching to a
pool_max_idle_timejust under the 350s all requests are completed correctlyLooks like it's possible to detect this behavior since AWS says that a RST packet is sent before closing the connection.
Our setup is also behind an AWS NAT, but it looks like the pool_max_idle_time is not completely fixing the issue, as I first said. I promised I'll put a packet analyzer, I just haven't had the time to do it yet, something came up. But I'll look into it soon.
- After switching to a
pool_max_idle_timejust under the 350s all requests are completed correctly
This fixed the issue with AWS NAT gateway for us, too.
We later switched to conn_max_idle_time (used to be named max_idle_time) instead.
That seems to be the better fix, as it terminates individual connections instead of terminating the entire pool.
I'm seeing this issue too. For me there are 2 questions here:
- There is no way to distinguish between a request that hasn't been made due to closed connection and a request that was terminated in-flight. Because of this we can't implement safe retries (outside of Finch) when we receive this error.
- If the request hasn't been sent at all it should be automatically retried by Finch with a new connection. I understand that if the connection was closed in the middle it's not something that can be handled automatically, and it makes sense to return the error.