mongodb TCP connect timeout errors

We're having issues with connections to our prod cluster. Sometimes a query will work, but more often than not, it will throw:

** (stop) exited in: :gen_server.call(#PID<6382.29775.0>, {:checkout, #Reference<6382.893348796.102498306.96870>, true}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen_server.erl:214: :gen_server.call/3
    src/poolboy.erl:55: :poolboy.checkout/3
    lib/db_connection/poolboy.ex:41: DBConnection.Poolboy.checkout/2
    lib/db_connection.ex:920: DBConnection.checkout/2
    lib/db_connection.ex:742: DBConnection.run/3
    lib/db_connection.ex:1133: DBConnection.run_meter/3
    lib/db_connection.ex:636: DBConnection.execute/4
    lib/mongo.ex:431: Mongo.kill_cursors/3

when called directly from the server. We're not clear on what process it is saying is not alive.

Our logs show tcp connect errors that look like:

04:05:52.177 [error] Mongo.Protocol (#PID<0.388.0>) failed to connect: ** (Mongo.Error) tcp connect: unknown POSIX error - :timeout
04:05:52.270 [error] GenServer #PID<0.388.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: []
State: Mongo.Protocol
04:05:52.271 [error] GenServer #PID<0.375.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: {:EXIT, #PID<0.348.0>, {:timeout, {:gen_server, :call, [#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000]}}}
State: {:state, #PID<0.376.0>, [#PID<0.386.0>, #PID<0.385.0>, #PID<0.384.0>, #PID<0.383.0>, #PID<0.382.0>, #PID<0.381.0>, #PID<0.380.0>, #PID<0.379.0>, #PID<0.378.0>, #PID<0.377.0>], {[], []}, #Reference<0.893348796.90570754.33113>, 10, 0, 0, :fifo}

and repeat indefinitely.

We're starting mongo in our supervision tree like:

...
  {Mongo, mongo()}
...

  defp mongo do
    config = Application.get_env(:blah, :mongo)

    if Keyword.has_key?(config, :seeds) do
      Keyword.update!(config, :seeds, fn seeds -> String.split(seeds, ",") end)
    else
      config
    end
  end

With config:

config :blah,
  mongo: [
    name: :blah_db,
    seeds: System.get_env("MONGODB_HOSTS"),
    database: System.get_env("MONGODB_NAME"),
    username: System.get_env("MONGODB_USERNAME"),
    password: System.get_env("MONGODB_PASSWORD"),
    pool: DBConnection.Poolboy,
  ]

We are not having any problems on our staging server which is using a standalone server.

We're using the current release 0.4.3.

Please help!

Jan 10 '18 05:01 brentonannan

Of course, we're happy to provide any other info needed, but at this point we're not sure how to find more useful info.

Jan 10 '18 05:01 brentonannan

More info: one of our infra folks have pointed out that we have a backup server in our replica set, which is set unreadable, and isolated from network ingress; this could explain the tcp connect errors.

We are thinking it could be the case that this failure is causing the entire pool to shutdown, giving the error we get when we try to query from the server.

Does this sound like a likely cause for our problems? If so, is there some way we can exclude a host from the topology?

Jan 10 '18 08:01 brentonannan

The backup server that cannot be reached is most likely the cause of the tco connect errors. However, that should not take down the entire application. It should just keep trying to make a connection to that server forever (should probably change that behaviour in some way).

As for the checkout issue, I am really unsure why that is happening. I will try to come up with some tests for you to run in order to figure out this issue.

Jan 10 '18 15:01 ankhers

Would you be able to test my exclude_hosts branch? It allows you to specify an :excluded_hosts key in the Mongo.start_link/1 function. It expects a list, similar to the :seeds key. If this fixes your crash, I can look more into why it is crashing, though I only expect it to prevent the tcp connect errors.

Jan 17 '18 02:01 ankhers

We most definitely will give your branch a go, thanks @ankhers. We've managed to get a local setup that replicates the problem with a bit of finagling of mongos inside docker composes, so we'll test your branch against that as soon as we're able.

Jan 17 '18 06:01 brentonannan

Is there any chance you can share those docker files so that I could do some additional testing?

Jan 17 '18 17:01 ankhers

One of the devs on my team got this sorted (the docker networking setup, not the issue as a whole); I haven't looked at his notes yet, but I'll ask if he can comment on here with some details for you (AFAIK it was a bit of a pain to replicate, I think he ended up going with hacking /etc/hosts in the client container to simulate a broken route).

Jan 18 '18 11:01 brentonannan

Sorry for taking time to reply, but I have written a gist on how to run the containers with sample docker-compose.yml file.

https://gist.github.com/vgunawan/4588cf26ca84ba57e39008d94dae032a

Follow through the instructions.md file, should give you some idea what to do.

Jan 21 '18 23:01 vgunawan

Did you end up testing that branch with your setup?

Feb 08 '18 18:02 ankhers

I'm having the same issue @ankhers , I've been investigating and debugging the code, I add more info:

This happens to me when using a remote public uri but once the replica set is updated my original host is removed from the Topology state and the servers are resolving to internal IPs (Amazon in my case), then monitors map is being updated to those keys (hosts and arbiters) as well. I could see it in update_rs_from_primary/1 function inside topology_description.ex

Regarding this problem, I think the mongo spec explains the problem and the expected behaviour here: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#clients-use-the-hostnames-listed-in-the-replica-set-config-not-the-seed-list

does it sound good?

PS: I tested with the exclude_hosts branch, the error doesn't happen but it doesn't work because if I exclude the hosts, my original public host is being removed as well (reading reconcile_servers/1 code) with your fix.

Apr 11 '18 19:04 pbrudnick

I had the same issue @ankhers, and I think I resolved it by changing the options for timeout and pool_timeout.

There are two(actually three) timeout value:

timeout used in connect timeout and receive TCP data timeout for now
pool_timeout used in checkout connection from a pool
connect_timeout_ms defined but not used

The problem is if connect MongoDB timeout (timeout value is 5000), checkout connection from a pool will timeout (timeout value is 5000 too) too. And checkout timeout will exit the process (https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L54). After times crash, the application will crash because of supervisor tree restart strategy. Then the Elixir node maybe down because of the application restart strategy.

So, we can modify the timeout for connect to MongoDB and checkout connection from a pool. Just like:

timeout: 5000,
pool_timeout: 8000

Actually, we should use connect_timeout_ms, but this option only defined, but not used.

Apr 27 '18 20:04 redink

Just to keep on top of this. We added the connect_timeout_ms option in 0.4.6. Unfortunately I think it should have just been called connect_timeout. but that can be fixed later.

May 21 '18 22:05 ankhers

Hey, not 100% sure, but having a similar error, must report that it totally had to do with internal (GCE) domain name resolution in my case: replacing local names with ips (which i never did with components on other platforms) totally solved my "timeout" problem.

p.s. I spoke too soon about "totally solving" the problem, I really have no idea, it "just works" now (except when it doesn't, and fails with timeout).

Jun 05 '18 11:06 costa

In my case I solved the issue passing this explicit option: type: :single

should we document better the options?

I was connecting remotely to an AWS mongodb and it was resolving the internal IPs giving errors, I just wanted to connect to my single server public dns.

Dec 04 '18 17:12 pbrudnick

I am getting same error after i upgrade to v5.0.x, previous version was working fine

Jun 23 '19 06:06 piyushcoader

Would anyone be willing to give me access to a database that is having this issue? I do not think there is much I can do if I am unable to see the issue firsthand.

Jun 23 '19 13:06 ankhers

I'm just going to leave this here. Someone asked a question on the elixir forum about this issue. I'm adding it here incase we are able to track it down from that.

Jun 25 '19 14:06 ankhers

mongodb mongodb copied to clipboard

TCP connect timeout errors

mongodb
mongodb copied to clipboard