mariaex icon indicating copy to clipboard operation
mariaex copied to clipboard

DNS Caching is causing failures on long running apps

Open sokser opened this issue 7 years ago • 7 comments

We are using Amazon Aurora to handles our databases. The setup involves a master and a salve with the our elixir app writing to the master. Periodically if there is a problem with one of the instances the master and slave may change. During this period we would expect the dns in mariaex to refresh itself and update the ip it is trying to hit. However we notice that even after several days, the connection errors are continuing. We've turned off caching at the erlang level and now we're stuck thinking if this is an issue with mariaex. Can you please let us know how we can force the app to re-resolve the dns at certain intervals or when encountering the error seen in the message below. Note: immediately after restarting the app the errors disappear, until the next time that the dns changes.

image

sokser avatar May 03 '17 19:05 sokser

There is not any DNS caching but once a connection is opened it will stay open so if master becomes slave it will stay pointing to slave. I've opened https://github.com/elixir-ecto/db_connection/issues/99 to add a feature that should help this, or similar, situation heal after a period of time. It may also help us GC some unused cache queries, particularly insert_all ones in Ecto.

fishcakez avatar Oct 19 '17 06:10 fishcakez

@fishcakez we're running into similar issues. Aurora initiates a failover, and when that happens they actually restart all the nodes -- we get ~16 errors

Elixir.Mariaex.Errorlib/ecto/adapters/sql.ex:440
[tcp] `recv` failed with: :closed

followed by almost 2000 ConnectionErrors (across a bunch of nodes)

Elixir.DBConnection.ConnectionError connection not available because of disconnection 
    lib/db_connection.ex:926 DBConnection.checkout/2
    lib/db_connection.ex:742 DBConnection.run/3
    lib/db_connection.ex:636 DBConnection.execute/4
    lib/ecto/adapters/sql.ex:256 Ecto.Adapters.SQL.sql_call/6
    lib/ecto/adapters/sql.ex:436 Ecto.Adapters.SQL.execute_or_reset/7
    lib/ecto/repo/queryable.ex:133 Ecto.Repo.Queryable.execute/5
    lib/ecto/repo/queryable.ex:37 Ecto.Repo.Queryable.all/4

After that it seems to stabilize and reconnect, but any writes we do trigger a read-only issue.

{:error, %Mariaex.Error{action: nil, connection_id: nil, mariadb: %{code: 1792, message: "Cannot execute statement in a READ ONLY transaction."}, message: nil, reason: nil, tag: nil}}

I think elixir-ecto/db_connection#99 would be a good first start, but there should be some way of resolving this inside mariaex? On a forced tcp close, instead of reusing the connection instance and reopening a connection to the same node, it should start a new connection, doing the DNS lookup to get the correct node.

Anyway, even elixir-ecto/db_connection#99 would be a good start, any ETA? We need to address this in some way because lately RDS has been doing failovers almost daily, and it causes a full service outage until someone goes and manually restarts all the nodes...

archseer avatar Dec 15 '17 03:12 archseer

I think elixir-ecto/db_connection#99 would be a good first start, but there should be some way of resolving this inside mariaex? On a forced tcp close, instead of reusing the connection instance and reopening a connection to the same node, it should start a new connection, doing the DNS lookup to get the correct node.

When the tcp socket closes we redo the DNS lookup when trying to create a fresh connection, on every connection attempt. I am not sure if there isn't anything we could do in mariaex to help here because we are relying on the OS's DNS lookup.

Anyway, even elixir-ecto/db_connection#99 would be a good start, any ETA? We need to address this in some way because lately RDS has been doing failovers almost daily, and it causes a full service outage until someone goes and manually restarts all the nodes...

ETA on next release is anytime next month and it is planned to include but I don't think anyone is working on this issue.

fishcakez avatar Dec 15 '17 04:12 fishcakez

We'll experiment turning off DNS caching on the erlang level; there's a cache_refresh flag -- although the default is 1 hour, and our nodes have been running for days, it should have expired by then.

archseer avatar Dec 15 '17 05:12 archseer

The cache_refresh flag should have no effect because Elixir/Erlang uses the OS's DNS by default - unless configured otherwise.

fishcakez avatar Dec 15 '17 05:12 fishcakez

In that case, if OS's DNS was used, why would rebooting the application (and not the whole node) fix the problem? Maybe the reconnect is faster than the AWS failover

archseer avatar Dec 15 '17 05:12 archseer

A reconnect is attempted frequently to begin with (and then slows down, this is configured with backoff_min, backoff_max, backoff_type) so I suppose its possible that DNS didn't propagate before master was restarted as slave, and Mariaex remains connected to slave.

I think you could try patching Mariaex to disconnect, instead of just error, on the code: 1792 error. This would be the simplest fix as it is just adding a case to statement or two.

fishcakez avatar Dec 15 '17 06:12 fishcakez