tigase-server problem with HA on AWS aurora mysql for DNS cache

Problem with DNS cache I was testing the AWS Aurora mysql failover procedure with tigase, reading the documentation I have found: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZSingleStandby.html scrolling to: "Setting the JVM TTL for DNS name lookups"

explains that is important to set a TTL for the java cache otherwise the failover does not work properly.

I was testing it on an instance of tigase and during my test I have found that tigase does not auto recover, after triggering a failover manually, in the logs there is:

[2023-04-21 16:39:01:728] [WARNING ] [         cluster-nodes ] ClConSQLRepository.removeItem()  : Problem removing elements from DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495)
	at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903)
	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2124)
	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2058)
	at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5158)
	at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2043)
	at jdk.internal.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at tigase.db.jdbc.PreparedStatementInvocationHandler.invoke(PreparedStatementInvocationHandler.java:38)
	at com.sun.proxy.$Proxy35.executeUpdate(Unknown Source)
	at tigase.cluster.repo.ClConSQLRepository.removeItem(ClConSQLRepository.java:137)
	at tigase.cluster.repo.ClConConfigRepository.itemLoaded(ClConConfigRepository.java:178)
	at tigase.cluster.repo.ClConSQLRepository.reload(ClConSQLRepository.java:221)
	at tigase.db.comp.ConfigRepository$1.run(ConfigRepository.java:81)
	at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
	at java.base/java.util.TimerThread.run(Timer.java:506)

this means that is still using the old endpoint of the DNS and it not switching to the new endpoint, I have left over 10 minutes always printing the same message.

Restarting the server resolves the problem but I need an automatic recovery.

I have change on the system in /etc/java-11-openjdk/security/java.security the value for networkaddress.cache.ttl as suggested in the AWS doc but I have the same results. There is any other way to set this parameter? Do you know why tigase do not consider the system configuration?

System info:

Tigase version: 8.1.2-b10915
JVM flavour and version: openjdk 11.0.18 2023-01-17 OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1) OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
Operating system/distribution/version: Ubuntu 20.04.6 LTS

Apr 21 '23 16:04 davidemarrone

We already set this property: https://github.com/tigase/tigase-utils/blob/f681e263b95013886a983c9b0706c2c391cfd2ba/src/main/java/tigase/util/dns/DNSResolverDefault.java#L170-L170

Your instance should reconnect correctly after 2 minutes thuen.

Apr 24 '23 15:04 woj-tek

thank you for the feedback, unfortunately from my tests it seems that it does not work. I have done several tries and also after 10/20 minutes tigase does not reconnect.

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

Apr 24 '23 16:04 davidemarrone

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

There is a cache for IP and SRV records (https://github.com/tigase/tigase-utils/blob/88bc1e649e281e3eeece1abc7acc32647a93a398/src/main/java/tigase/util/dns/DNSResolverDefault.java#L46-L46)

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

I've made a small change that will virtually respect change from $JAVA_HOME/jre/lib/security/java.security (we will set it ourselves if it's not set prior/in system)

Apr 24 '23 17:04 woj-tek

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

There is a cache for IP and SRV records (https://github.com/tigase/tigase-utils/blob/88bc1e649e281e3eeece1abc7acc32647a93a398/src/main/java/tigase/util/dns/DNSResolverDefault.java#L46-L46)

ok but this should not interact / cache the dataSource uri, right?

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

I've made a small change that will virtually respect change from $JAVA_HOME/jre/lib/security/java.security (we will set it ourselves if it's not set prior/in system)

ok thanks, have you ever done a failover test with AWS aurora mysql with success?

Tomorrow I will try a newer version to see if the 8.2 or 8.3 have the same problem

Apr 24 '23 18:04 davidemarrone

I have done a test upgrading to 8.2.2 but I have the same problem, I have also tried to add useCallableMysqlWorkaround = 'true' just to test if this could be the problem but nothing change, after 120 secs tigase still tries to connect to the old instance of mysql logging the exception The MySQL server is running with the --read-only option so it cannot execute this statement

Apr 25 '23 09:04 davidemarrone

ok but this should not interact / cache the dataSource uri, right?

Correct, internal cache of DNS resolver is only used in XMPP layer. DNS resolution for data sources is done via JDBC driver.

ok thanks, have you ever done a failover test with AWS aurora mysql with success?

Yes, but it was a couple of years back (when we changed the JVM DNS caching setting) and it was working correctly.

Regarding your current issue - it seems that for some reason, during failover, Tigase gets disconnected from the database but it manages to reconnect to the older instance that was made slave/read-only instance.

Which RDS endpoint address do you use (the cluster one?). Does it point to write replica after performing failover? Which type of failover do you perform?

Apr 25 '23 17:04 woj-tek

I'm triggering the failover from the console, please note that I'm using an AWS aurora cluster with mysql compatible with 5.7 (and not AWS RDS multi az that has a different failover strategy). In this scenario I just select the writer instance of the cluster and I click "failover", the cluster has just a writer and one replica.

Yes I'm using the cluster endpoint and not the instance one. If I restart tigase all works fine. So it means that after restart without changing the conf it just resolve to new address.

I do not think that during failover Tigase gets disconnected, I guess there is something cached and tigase and it is still use for some reason the old connection with host resolved before the failover, in the test config i use 'pool-size' = '12'

Is it possible that it is trying to use another connection from the pool already bounded to old mysql instance but not used yet?

If I remember correctly tigase create all the connections of the db pool at startup so I guess that during the failover all the connections just remain attached to the old instances since aurora switch the DNS and I think doesn't close active connections, maybe the only solution is to manage explicitly that exception in tigase and re-create all connections of the db pool.

The manual failover is useful to change the instance type with limited down and it is also used by AWS during the upgrade of mysql version

Apr 25 '23 17:04 davidemarrone

Indeed, if there is no disconnection then the existing database connection pool will remain with all the connections.

If you open mysql connection from the shell and do the failover does the connection stay open? Can you then operate on it or (as it seems) you are still connected to the same instance but it's made RO (just like in Tigase case)?

Apr 26 '23 14:04 woj-tek

I have made the test during the failover, the mysql client do not trigger any error but then if I make a query this is the result:

mysql> select * from tig_users order by uid desc limit 0,2 \G
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    11
Current database: tigasedb

I have added to the test netstat -tpn | grep 3306 there is nothing on the machine beside the mysql client when is connected before the failover:

tcp        0      0 172.31.30.220:1332      172.31.11.137:3306      ESTABLISHED 2280/mysql

after the failover

tcp        1      0 172.31.30.220:1332      172.31.11.137:3306      CLOSE_WAIT  2280/mysql

so it seems that the failover actually close the TCP connection, so I have then run another test with this script:

date
mysql --ssl-mode=DISABLED -h clusterhostname  -u tigase -ppassword -D tigasedb -e "select 1"  2>&1 |grep -v "Using a password on the command line interface"

dig +short clusterhostname

this is the output

Wed 26 Apr 2023 03:27:07 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:08 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:09 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:10 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:11 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:12 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:13 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:14 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:15 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:16 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:17 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

# -> this is really strage it seems that for some seconds it resolve with the secondary and allow connection:
# -> ----- start
Wed 26 Apr 2023 03:27:18 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:19 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:20 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:21 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:22 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:23 PM UTC
1
1
# -> ----- end

cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:24 PM UTC
1
1
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:25 PM UTC
1
1
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:26 PM UTC
1
1

as in the comment it seems that short after the failover the DNS still point to the old primary writer allowing connection but in read only, it last for some seconds, so I have done this test to confirm it:

date
mysql --ssl-mode=DISABLED -h clusterhostname  -u tigaseUser -ppassword -D tigasedb -e "select @@hostname;  update tig_offline_messages set expired = now() where msg_id = 1; "  2>&1 |grep -v "Using a password on the command line interface"
dig +short clusterhostname
echo

and as expected this is the problem

Wed 26 Apr 2023 03:40:54 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:55 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:56 PM UTC
@@hostname
ip-10-4-2-74
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:57 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:58 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:59 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:00 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:01 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:02 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:03 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:04 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:05 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# after the failover still resolving with old ip and accepting connection but istance is in read only mode
Wed 26 Apr 2023 03:41:06 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:07 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:08 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:09 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:10 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# after some secs goes to the correct ip
Wed 26 Apr 2023 03:41:11 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:12 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:13 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

with an update query I can trigger the "The MySQL server is running with the --read-only", after 5 secs it come back to the correct host, so this is what happening, tigase connects immediately to old writer instance in read only mode and stay stuck there forever, I guess you must manage explicitly that exception

Apr 26 '23 15:04 davidemarrone

I have found the same problem described here: https://proxysql.com/blog/failover-comparison-in-aurora-mysql-2-10-0-using-proxysql-vs-auroras-cluster-endpoint/

Apr 27 '23 11:04 davidemarrone

So, they ran into the same issue but reading your previous comment:

as in the comment it seems that short after the failover the DNS still point to the old primary writer allowing connection but in read only, it last for some seconds, so I have done this test to confirm it:

I wonder if this isn't an issue with the RDS. I mean - if they do the failover they should coordinate the disconnection/DNS update/making instance available again. Because currently they currently do disconnect, make (now reader) instance available under cluster enpoint, update dns. They should switch the last two steps.

Have you considered rising the issue with the AWS tech support?

Apr 27 '23 14:04 woj-tek

Yes it seems a bug on AWS, I will try to open a case also if I know it will take a long time and is not guarantee that they will fix it. Do you think you are going to manage it and make a workaround on your side?

Apr 27 '23 14:04 davidemarrone

I filed a ticket for it (ref: server-1354) but it's not on our immediate agenda and doesn't have ETA.

Apr 27 '23 15:04 woj-tek

thank you, I will update here if I have some answers from AWS

Apr 27 '23 15:04 davidemarrone

anyway I think should be also a quite important topic on your side. If I send a message to tigase to a user that is offline, and the DB is in that state the messages get lost without returning any error to the sender, it would better if tigase just shut down completely

Apr 27 '23 16:04 davidemarrone

Hmm, most database exceptions (and this is just a SQLException in the end) should yield an error returned. Could you share ST/error log from the case resulting in message loss?

Apr 28 '23 13:04 woj-tek

This is the exception:

[2023-04-28 17:00:19:922] [WARNING ] [              in_4-amp ] JDBCMsgRepository.storeMessage() : Problem adding new entry to DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495)
	at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903)
	at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2011)
	at com.mysql.jdbc.CallableStatement.executeQuery(CallableStatement.java:872)
	at jdk.internal.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at tigase.db.jdbc.PreparedStatementInvocationHandler.invoke(PreparedStatementInvocationHandler.java:38)
	at com.sun.proxy.$Proxy35.executeQuery(Unknown Source)
	at tigase.server.amp.db.JDBCMsgRepository.storeMessage(JDBCMsgRepository.java:414)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at tigase.stats.StatisticsInvocationHandler.invoke(StatisticsInvocationHandler.java:75)
	at com.sun.proxy.$Proxy38.storeMessage(Unknown Source)
	at tigase.server.amp.db.MsgRepository$MsgRepositoryMDBean.storeMessage(MsgRepository.java:326)
	at tigase.server.amp.action.Store.execute(Store.java:116)
	at tigase.server.amp.AmpComponent.processPacket(AmpComponent.java:198)
	at tigase.cluster.AmpComponentClustered.processPacket(AmpComponentClustered.java:72)
	at tigase.server.AbstractMessageReceiver$QueueListener.run(AbstractMessageReceiver.java:1405)

I do not receive any error on the XMPP submit, I guess that tigase is async and accept the message that needs to be delivered and sends an ack to the sender then it found that the recipient is offline and tries to store the message to the MsgRepository. If the recipient is online the delivery works also if mysql is not working but if is online this exception is triggered without any errors for the sender

Apr 28 '23 17:04 davidemarrone

Tigase in general operates in async manner (it's based on message passing, so in a way it's event based).

In this case there should be an error returned, we will add it in the future (ref: server-1355)

May 02 '23 15:05 woj-tek

I'm having a big argue with AWS support about this bug, basically they says:

I have conducted some testing in my lab and have observed the same outcome as yourself. I also noticed a 5-second delay in updating the DNS, which I would like to clarify is not a bug, but rather it is by design. Aurora DNS zones have a short time-to-live (TTL) of 5 seconds, and it is crucial that your network and client configurations do not further increase this. This is how Aurora is designed, and the delay you are experiencing is indeed expected.

linking me this article: https://repost.aws/knowledge-center/aurora-mysql-db-cluser-read-only-error

anyway now I have another doubt, why at some point tigase do not start to resolve the correct DNS entry? Also if after the down it connects to the wrong instance and starts to issue the errors describe at the beginning:

[2023-04-21 16:39:01:728] [WARNING ] [         cluster-nodes ] ClConSQLRepository.removeItem()  : Problem removing elements from DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement

after the networkaddress.cache.ttl it should start to resolving the correct DNS entry for database and stop logging errors but I left it also for 10 minutes and nothing change, still logging the error.

Is it possible that there are other cache further than networkaddress.cache.ttl for the DNS resolution of the database address in tigase?

May 03 '23 14:05 davidemarrone

I have also found this article: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/

that exactly explains this problem, is written by AWS developers, as described in the article:

"the client kept trying to connect to the old writer, including a couple of seconds after that writer has already been demoted to a reader. "

is a issue related to AWS aurora and the solution is to decrease the DNS TTL or use RDS proxy or maybe use AWS mysql drivers.

Anyway as written in the previous comment I do not understand why tigase after the networkaddress.cache.ttl do not starts to resolve the correct DNS address

May 03 '23 14:05 davidemarrone

after the networkaddress.cache.ttl it should start to resolving the correct DNS entry for database and stop logging errors but I left it also for 10 minutes and nothing change, still logging the error.

Is it possible that there are other cache further than networkaddress.cache.ttl for the DNS resolution of the database address in tigase?

Unfortunately not. This exception is not connection exception but merely execution exception so Tigase successfully connected (TCP connection) to the instance (albeit wrong one) and persist the connection. So unless the connection is terminated one way or another (a restart in this case) it will keep those connections in the pool.

So, adding handling of this situation and recreating connection pool would resolve it.

But I'd argue, that it's still a bug on AWS, and they should update the failover procedure so the moment it happens, the endpoint already points to the correct instance.

EDIT: sorry, missed your other reply.

is a issue related to AWS aurora and the solution is to decrease the DNS TTL or use RDS proxy or maybe use AWS mysql drivers.

Anyway as written in the previous comment I do not understand why tigase after the networkaddress.cache.ttl do not starts to resolve the correct DNS address

Tigase does not start to (re-)connect to the old DNS address - when the failover happens and connection is terminated Tigase reconnects to the database at once (thus, within 5s window), which results in old DNS entry. After that there is no DNS resolution as there is no reconnection attempt.

We do use official MySQL JDBC driver, though AWS seems to suggest using (forked) MariaDB JDBC driver, which supposedly has support for the use-case.

May 03 '23 14:05 woj-tek

This exception is not connection exception but merely execution exception so Tigase successfully connected (TCP connection) to the instance (albeit wrong one) and persist the connection

got it

But I'd argue, that it's still a bug on AWS, and they should update the failover procedure so the moment it happens, the endpoint already points to the correct instance.

I have wrote a comment on their blog https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ upvote or maybe comment yourself if you like

May 03 '23 15:05 davidemarrone

The AWS developers has given an official response in the comments of: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/

TL;DR -> They are not going to make any changes so must be handled from tigase side

May 22 '23 10:05 davidemarrone

Thanks for the update. I'm not really thrilled by the AWS response, and not sure if it's exactly accurate... They could simply avoid making instance available again after failover before updating the DNS, they even highlight it in their example:

ERROR 2003 (HY000): Can't connect to MySQL server on 'aurora-test.cluster-xxx.us-west-2.rds.amazonaws.com:3306' (111)
ERROR 2003 (HY000): Can't connect to MySQL server on 'aurora-test.cluster-xxx.us-west-2.rds.amazonaws.com:3306' (111)
2022-12-09 18:57:12 reader
2022-12-09 18:57:13 reader
2022-12-09 18:57:14 writer

They could simply wait until DNS update and then bring both instances up...

Or they could even use their RDS Aurora proxy internally, that would solve the issue and would increase HA...

At any rate, we will probably add handling for this use-case, but there is no ETA.

May 22 '23 22:05 woj-tek

It is exactly what I commented on: wait until the DNS update and then bring both instances up. But I guess that by doing it this way, they have to introduce a longer downtime for those who use smart drivers. I think that is the reason why they have done it this way. For sure, they could use RDS Aurora proxy internally, but I guess they prefer that customers pay for it :).

I do not think they will change it in the future, as pointed out in the comments the only solutions to this problem are:

Use SmartDriver
Use RDS proxy
Explicitly manage MySQL "read-only" exception

So without modifying tigase the only external solution is to use RDS proxy

May 24 '23 16:05 davidemarrone

As I said - we will probably add support for that, but in your case you can use suggested JDBC driver to mitigate the issue - either by swapping with one from tarball distribution package or modifying docker image (will require your own build though).

May 29 '23 19:05 woj-tek

Ok, thanks for the feedback, I will try to swap the JDBC driver

May 29 '23 19:05 davidemarrone

I have tryed to swap the driver but I get:

[2023-05-30 07:27:47:000] [SEVERE  ] [ForkJoinPool-1-worker-23 ] DataRepositoryImpl.initialize(): null
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at tigase.db.jdbc.DataRepositoryImpl.initialize(DataRepositoryImpl.java:286)
	at tigase.db.beans.DataSourceBean$DataSourceMDConfigBean.initRepository(DataSourceBean.java:322)
	at tigase.db.beans.DataSourceBean$DataSourceMDConfigBean.initRepository(DataSourceBean.java:248)
	at tigase.db.beans.MDPoolConfigBean.lambda$setInstances$0(MDPoolConfigBean.java:166)
	at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

from AWS:

Use the driver name: software.aws.rds.jdbc.mysql.Driver. You'll need the driver name when loading the driver explicitly to the driver manager.

how can I override database.driverClassName from startup command line or config.tdsl?

May 30 '23 07:05 davidemarrone

Unfortunately, it turns out, we do have driver classes hardcoded, so the swapping won't work. I filed a ticket to remove it in the next version.

May 30 '23 14:05 woj-tek

Hi, I'm trying to fix this problem using AWS RDS proxy, it seems it works well, the only problem that I have is that the proxy force to close all the idle connection after 24 hours

RDS Proxy closes database connections some time after 24 hours when they are no longer in use. The proxy performs this action regardless of the value of the maximum idle connections setting.

and I get some error from in the tigase-console.log like com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure or com.mysql.cj.exceptions.StatementIsClosedException: No operations allowed after statement closed and some operations fails. In the configuration of tigase in the section of dataSource -> default can I use: 'conn-valid-query' = 'select 1' or is reserved only for authRepository? Do you have any idea how can I keep the mysql connection alive with ping or something similar? I saw in the driver the parameter autoReconnectForPools, reading on stackoverflow it seems what I need, but is not well documented from mysql ...

Aug 02 '23 11:08 davidemarrone

tigase-server tigase-server copied to clipboard

problem with HA on AWS aurora mysql for DNS cache

tigase-server
tigase-server copied to clipboard