mongodb
mongodb copied to clipboard
TCP connect timeout errors
We're having issues with connections to our prod cluster. Sometimes a query will work, but more often than not, it will throw:
** (stop) exited in: :gen_server.call(#PID<6382.29775.0>, {:checkout, #Reference<6382.893348796.102498306.96870>, true}, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(stdlib) gen_server.erl:214: :gen_server.call/3
src/poolboy.erl:55: :poolboy.checkout/3
lib/db_connection/poolboy.ex:41: DBConnection.Poolboy.checkout/2
lib/db_connection.ex:920: DBConnection.checkout/2
lib/db_connection.ex:742: DBConnection.run/3
lib/db_connection.ex:1133: DBConnection.run_meter/3
lib/db_connection.ex:636: DBConnection.execute/4
lib/mongo.ex:431: Mongo.kill_cursors/3
when called directly from the server. We're not clear on what process it is saying is not alive.
Our logs show tcp connect
errors that look like:
04:05:52.177 [error] Mongo.Protocol (#PID<0.388.0>) failed to connect: ** (Mongo.Error) tcp connect: unknown POSIX error - :timeout
04:05:52.270 [error] GenServer #PID<0.388.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: []
State: Mongo.Protocol
04:05:52.271 [error] GenServer #PID<0.375.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: {:EXIT, #PID<0.348.0>, {:timeout, {:gen_server, :call, [#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000]}}}
State: {:state, #PID<0.376.0>, [#PID<0.386.0>, #PID<0.385.0>, #PID<0.384.0>, #PID<0.383.0>, #PID<0.382.0>, #PID<0.381.0>, #PID<0.380.0>, #PID<0.379.0>, #PID<0.378.0>, #PID<0.377.0>], {[], []}, #Reference<0.893348796.90570754.33113>, 10, 0, 0, :fifo}
and repeat indefinitely.
We're starting mongo in our supervision tree like:
...
{Mongo, mongo()}
...
defp mongo do
config = Application.get_env(:blah, :mongo)
if Keyword.has_key?(config, :seeds) do
Keyword.update!(config, :seeds, fn seeds -> String.split(seeds, ",") end)
else
config
end
end
With config:
config :blah,
mongo: [
name: :blah_db,
seeds: System.get_env("MONGODB_HOSTS"),
database: System.get_env("MONGODB_NAME"),
username: System.get_env("MONGODB_USERNAME"),
password: System.get_env("MONGODB_PASSWORD"),
pool: DBConnection.Poolboy,
]
We are not having any problems on our staging server which is using a standalone server.
We're using the current release 0.4.3
.
Please help!
Of course, we're happy to provide any other info needed, but at this point we're not sure how to find more useful info.
More info: one of our infra folks have pointed out that we have a backup server in our replica set, which is set unreadable, and isolated from network ingress; this could explain the tcp connect
errors.
We are thinking it could be the case that this failure is causing the entire pool to shutdown, giving the error we get when we try to query from the server.
Does this sound like a likely cause for our problems? If so, is there some way we can exclude a host from the topology?
The backup server that cannot be reached is most likely the cause of the tco connect errors. However, that should not take down the entire application. It should just keep trying to make a connection to that server forever (should probably change that behaviour in some way).
As for the checkout issue, I am really unsure why that is happening. I will try to come up with some tests for you to run in order to figure out this issue.
Would you be able to test my exclude_hosts
branch? It allows you to specify an :excluded_hosts
key in the Mongo.start_link/1
function. It expects a list, similar to the :seeds
key. If this fixes your crash, I can look more into why it is crashing, though I only expect it to prevent the tcp connect errors.
We most definitely will give your branch a go, thanks @ankhers. We've managed to get a local setup that replicates the problem with a bit of finagling of mongos inside docker composes, so we'll test your branch against that as soon as we're able.
Is there any chance you can share those docker files so that I could do some additional testing?
One of the devs on my team got this sorted (the docker networking setup, not the issue as a whole); I haven't looked at his notes yet, but I'll ask if he can comment on here with some details for you (AFAIK it was a bit of a pain to replicate, I think he ended up going with hacking /etc/hosts
in the client container to simulate a broken route).
Sorry for taking time to reply, but I have written a gist on how to run the containers with sample docker-compose.yml
file.
https://gist.github.com/vgunawan/4588cf26ca84ba57e39008d94dae032a
Follow through the instructions.md file, should give you some idea what to do.
Did you end up testing that branch with your setup?
I'm having the same issue @ankhers , I've been investigating and debugging the code, I add more info:
This happens to me when using a remote public uri but once the replica set is updated my original host is removed from the Topology state and the servers
are resolving to internal IPs (Amazon in my case), then monitors
map is being updated to those keys (hosts and arbiters) as well.
I could see it in update_rs_from_primary/1
function inside topology_description.ex
Regarding this problem, I think the mongo spec explains the problem and the expected behaviour here: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#clients-use-the-hostnames-listed-in-the-replica-set-config-not-the-seed-list
does it sound good?
PS: I tested with the exclude_hosts
branch, the error doesn't happen but it doesn't work because if I exclude the hosts, my original public host is being removed as well (reading reconcile_servers/1
code) with your fix.
I had the same issue @ankhers, and I think I resolved it by changing the options for timeout
and pool_timeout
.
There are two(actually three) timeout value:
- timeout used in connect timeout and receive TCP data timeout for now
- pool_timeout used in checkout connection from a pool
- connect_timeout_ms defined but not used
The problem is if connect MongoDB timeout (timeout value is 5000), checkout connection from a pool will timeout (timeout value is 5000 too) too. And checkout timeout will exit the process (https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L54). After times crash, the application
will crash because of supervisor tree restart strategy. Then the Elixir node maybe down because of the application restart strategy.
So, we can modify the timeout for connect to MongoDB and checkout connection from a pool. Just like:
timeout: 5000,
pool_timeout: 8000
Actually, we should use connect_timeout_ms
, but this option only defined, but not used.
Just to keep on top of this. We added the connect_timeout_ms
option in 0.4.6. Unfortunately I think it should have just been called connect_timeout
. but that can be fixed later.
Hey, not 100% sure, but having a similar error, must report that it totally had to do with internal (GCE) domain name resolution in my case: replacing local names with ips (which i never did with components on other platforms) totally solved my "timeout" problem.
p.s. I spoke too soon about "totally solving" the problem, I really have no idea, it "just works" now (except when it doesn't, and fails with timeout).
In my case I solved the issue passing this explicit option:
type: :single
should we document better the options?
I was connecting remotely to an AWS mongodb and it was resolving the internal IPs giving errors, I just wanted to connect to my single server public dns.
I am getting same error after i upgrade to v5.0.x, previous version was working fine
Would anyone be willing to give me access to a database that is having this issue? I do not think there is much I can do if I am unable to see the issue firsthand.
I'm just going to leave this here. Someone asked a question on the elixir forum about this issue. I'm adding it here incase we are able to track it down from that.