req icon indicating copy to clipboard operation
req copied to clipboard

Mint.TransportError - socket closed

Open sergiotapia opened this issue 1 year ago • 5 comments

(Mint.TransportError) socket closed\\n (req 0.4.8) lib/req.ex:1029: Req.request!/

It's quite hard to find a repro for this bug it's frequent enough that it shows up regularly on my Oban jobs and even when running code in iex -S mix phx.server. To fix I just need to re-run it, or let the Oban job retry.

image

Any ideas why this would happen so frequently? Again, I appreciate it's hard to debug without more repro steps but happy to figure out how to find that repro case!

Urls where this happens:

OpenAI's API, HowLongtoBeat's API, Target's RedSky API.

sergiotapia avatar Feb 24 '24 21:02 sergiotapia

I can quite easily replicate this issue. Just reload the module that makes a Req request in an iex session. For me, after the code reload, I get socket closed.

jozuas avatar Mar 08 '24 16:03 jozuas

For anyone running into this issue, please see https://github.com/sneako/finch/pull/273#issuecomment-2144879631, you can test it by setting Finch dependency to:

{:finch, github: "keathley/finch", branch: "handle-idle-connections", override: true}

wojtekmach avatar Jun 03 '24 13:06 wojtekmach

This is still happening very frequently. I'm calling Req.post from Oban jobs and it happens quite a lot.

Req.post(@base_url,
  headers: headers,
  json: body
)

"** (Req.TransportError) socket closed
    (req 0.5.6) lib/req.ex:1092: Req.request!/2

sergiotapia avatar Oct 10 '24 13:10 sergiotapia

I have the same issue, I have a lot of logs with this error message:

** (Mint.TransportError) socket closed

Req.retry/3: Got exception. Will retry in 0ms, 1 attempt left

does that mean that the request failed and it is going to retry it? or that it failed completely?

mrkaspa avatar Oct 23 '24 17:10 mrkaspa

It means the the request will be retried.

wojtekmach avatar Oct 23 '24 18:10 wojtekmach

We're observing an abnormally high incidence of Req.TransportError (socket closed) in our Oban jobs sending requests to APIs.

It is expected that sometimes this might happen, but the incidence really seems abnormal to what we'd expect, and it doesn't seem correlated to a single service but to all different APIs we request often. At a point we checked if there would be any configuration issue on our provisioning that could be related.

Is there any guideline, workaround or known issues related to this?

badosu avatar Jul 29 '25 18:07 badosu

Hey Amadeus, long time no see! Nothing comes to mind however if you are able to consistently reproduce this issue it’d be really appreciated if you could debug this by removing layers, if you can reproduce the same error using Finch directly then that’s a sign the problem is not in Req. And then if you can reproduce the problem with using Mint directly, maybe it’s a matter of socket configuration. Worth using different http clients and seeing if you can reproduce this.

One Oban specific piece of advice, Req does retries by default for GET requests, Oban jobs tend to do retries too so you might be retrying more than expected.

wojtekmach avatar Jul 29 '25 18:07 wojtekmach

Hey Amadeus, long time no see!

Hey Wojtek, glad to see you too! I was wondering if you'd remember, I'm very happy to see you so active in this community I've joined a couple years ago, keep it up!

One Oban specific piece of advice, Req does retries by default for GET requests, Oban jobs tend to do retries too so you might be retrying more than expected.

Yes, we're aware. Our critical non-idempotent jobs are not allowed to be retried unless carefully inspected to be safe to do so, what exacerbates this issue for us, since we have to track every instance of this error occurring or to allow retries for this case. Keeping in mind our volume is still very low compared to our goal, to already be observing such high incidence, so this issue is very relevant for us.

For non critical jobs (still not idempotent, we don't allow retries) we're considering introducing this option (keep get and head as they are, allow retry on any :econnrefused | :closed) to test out if the issue is mitigated:

retry: fn
  %Req.Request{method: method}, response_or_exception
  when method in [:get, :head] ->
    # Vendored from Req.Steps
    Utilities.Req.transient?(response_or_exception)

  _request, %Req.TransportError{reason: reason}
  when reason in [:closed, :econnrefused] ->
    true

  _request, _response_or_error ->
    false
end

if you could debug this by removing layers, if you can reproduce the same error using Finch directly then that’s a sign the problem is not in Req

Any suggestion on how we can instrument this so we can provide useful data?

badosu avatar Jul 29 '25 18:07 badosu

We're observing an abnormally high incidence of Req.TransportError (socket closed) in our Oban jobs sending requests to APIs.

It is expected that sometimes this might happen, but the incidence really seems abnormal to what we'd expect, and it doesn't seem correlated to a single service but to all different APIs we request often. At a point we checked if there would be any configuration issue on our provisioning that could be related.

Is there any guideline, workaround or known issues related to this?

This has been my experience as well - never figured out why it happens still

sergiotapia avatar Jul 29 '25 18:07 sergiotapia

This is anecdata, but this was a big issue for us, and it seemed to happen only for connections to particular upstream servers (most notably Anthropic API). We couldn't find a root cause in reasonable time, so we switched to HTTPoison for these problematic connections.

kzemek avatar Oct 01 '25 13:10 kzemek

We couldn't find a root cause in reasonable time, so we switched to HTTPoison for these problematic connections.

Would you mind sharing whether switching to httpoison decreased the perceived (or hard data if you got) amount of timeouts?

badosu avatar Oct 01 '25 19:10 badosu

We couldn't find a root cause in reasonable time, so we switched to HTTPoison for these problematic connections.

Would you mind sharing whether switching to httpoison decreased the perceived (or hard data if you got) amount of timeouts?

Yes, we switched around April 2024 and the issue disappeared.

FWIW we're behind AWS NAT, so the workaround in https://github.com/sneako/finch/issues/272#issuecomment-2145675560 may be a solution for us, but IIRC we also saw the same issue with HTTP/2 pools that don't have pool_max_idle_time

kzemek avatar Oct 02 '25 08:10 kzemek