aws-mysql-jdbc
aws-mysql-jdbc copied to clipboard
JTA Transactions fail with XAER_RMFAIL
Describe the bug
We have the issue that some of our JTA transactions fail with the Exception below.
The same transactions work nicely with the vanilla MySQL driver (8.0.29+).
This happened with v1.1.0.
Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the PREPARED state
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1201)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:231)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:200)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyAndConnectIfNeeded(FailoverConnectionPlugin.java:714)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:247)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:320)
at com.awssql//com.sun.proxy.$Proxy265.getResultSet(Unknown Source)
at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAConnection.dispatchCommand(MysqlXAConnection.java:325)
... 45 more
Expected Behavior
JTA transactions work the same as with the latest vanilla MySQL drivers.
Current Behavior
Transaction rollback
Reproduction Steps
This happens in specific scenarios with no trivial way to recreate this.
Possible Solution
I suppose this may be related to the rather old MySQL base version used for the AWS Driver.
Additional Information/Context
No response
The AWS JDBC Driver for MySQL version used
1.1.0
JDK version used
JDK 11
Operating System and version
AL2 (hosted in EKS)
i can confirm that we see the same exceptions in our project. we've had them for at least a month, but usually less than 5 or so per day. but starting with October 6th (which coincides with increased traffic on our system) we saw a steady incline to up to 100 per day, which starts to be concerning. so far we tried to set the jdbc connection URL parameter pinGlobalTxToPhysicalConnection to true, without success. The exceptions also occur "all over the place" and there seems to be no obvious commonality between them, at least none that we could figure out so far.
I realize this is probably not very helpful information to get to the source of the problem, but nonetheless i wanted to confirm that it's not just a "single customer problem". we're open to providing additional information if needed/helpful. let us know how we can support the effort.
Hello @flauschie, @mariokoehler
Thank you for raising the issue. We understand this issue is causing inconveniences in your workflows. We'll be investigating this issue and will update you when we have a release. We appreciate your patience.
FWIW, our JTA transactions can contain a fairly big payload with megabytes of data and intermediate updates to the same data during the transaction. Simple JTA transactions don't seem to be affected.
Maybe it's related or not, but we also had JTA issues with the RDS proxy (which we had to abandon). If the RDS proxy also uses this driver, it could have the same root cause.
in our case it's much simpler, the only reason we use the XA version of the datasource at all is because we do some of our "less important" queries on the read-only instance of the Aurora RDS cluster (to take some load off of the main instance). So we have two datasources defined in our persistence.xml and effectively have to EntityManagers, one associated with the main instance and one with the readonly instance. And sometimes we use both of them in the same transaction, thats why we need the XA. But we never really have a "typical" XA-scenario, so for example we don't have write-operations that span more than one EntityManager/Database.
FWIW, for complicated reasons, our JTA transactions can span across 4 datasources.
by further trial-and-error we discovered what seems to be a correlation to the connection pool size. by increasing the pool size we were able to reduce the frequency of the exceptions. we also modified the way the connection pool manages connections, so that it doesn't hold a constant size pool of connections but instead only has a steady size of 10 and increases/decreases this as needed. that was done so that we can better monitor the actual usage (%) of connections via the RDS cloudwatch metrics. Where before the metric always showed a flat line of the max pool size, we can now see the actual usage (given it's above 10 connections). By doing this we noticed that the exceptions seem to coincide with peaks of high connection usage. whenever the pool is close to being exhausted (maybe even exhausted, it's hard to tell because of the metrics resolution) the exceptions keep coming. when the pool is only moderately busy we can have several hours without any exceptions.
We always had different min/max settings for our pools. I don't think that the max pool setting was ever hit during our regression testing.
FWIW, we are using Wildfly 26 (Narayana) as the JTA engine.
yesterday we switched back to the "vanilla" driver and we're free of exceptions ever since. so it really seems to be related to the driver itself.
Thank you all for raising this issue. We are currently working on a fix for this issue and will post again once the fix is merged. Thank you for your patience.
The fix has been merged, can you please test out our 1.1.2-Snapshot build here and let us know if the issue persists. Thank you.
Closing this issue since this has been addressed in #297. Please re-open if this is still a problem.
@hsuamz
The issue has not been solved. After switching from the native MySQL driver to the AWS version (1.1.4) we are still getting the same XAER_RMFAIL exception in Wildfly 26 (Narayana JTA):
ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=44, bqual_length=36, tx_uid=0:ffffac1084d4:-23b1904f:63f2ccd7:17cad, node_name=86c98bcdd6-kwq2f, branch_uid=0:ffffac1084d4:-23b1904f:63f2ccd7:17cb2, subordinatenodename=null, eis_name=java:jboss/datasources/xxxxxxx > (XAResourceWrapperImpl@858080e[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@50d575c7 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/xxxx]) failed with exception code XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: Undetermined error occurred in the underlying Connection - check your data for consistency
Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state
ARJUNA016056: TransactionImple.delistResource - caught exception during delist : XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state
ARJUNA016138: Failed to enlist XA resource XAResourceWrapperImpl@7ec9e8f9[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@512d69a1 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/xxxx]: java.lang.IllegalStateException: XAER_RMFAIL: The command cannot be executed when global transaction is in the PREPARED state
Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the PREPARED state
Driver setting:
<driver name="aws" module="com.mysql">
<driver-class>software.aws.rds.jdbc.mysql.Driver</driver-class>
<xa-datasource-class>software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXADataSource</xa-datasource-class>
</driver>
Hi @flauschie, we are looking into this issue.
To help us with our investigation, are you able to provide us with the following?
- a full stacktrace of the error
- driver logs (see this on how to enable driver logs)
Are you seeing the issue with the same frequency as before?
Thank you.
Hi @karenc-bq
Here's a broader stacktrace - the lines with an empty {} are omitted lines (our own code): exception.txt
Sorry for the format, but these exceptions are breaking the logging format of the log service ;-)
As for the driver trace ... this is difficult at the moment. The issue only occurs during integration testing, and I can't run the related test locally. Our QA is currently busy with a release. Maybe in 1-2 weeks.
I can't really remember the error frequency during our initial testing (based on 1.1.0). We don't keep logs that long. Probably similar.
The exception doesn't occur during simple tests - only with some substantial load and possibly competition.
However, the original/Oracle MySQL driver doesn't show these JTA-related errors - and our code can recover from "normal" PersistenceExceptions due to competition.
FWIW, the driver produced 8 errors during a 45min test run with thousands of transactions.
Generally: we have 4 MySQL data sources that could, in theory, take part in the JTA transaction. However, I don't think that this error occurred with more than 2 of them involved.
Hi @flauschie, we are still unable to reproduce the issue, however we recently merged fix #364 that may impact this issue. Is it possible to run your integration tests with our snapshot build? Thank you.
Closing this issue due to inactivity and recent changes that went into the driver that addresses the issue. If the issue is still present, please provide @karenc-bq with the information above. Thank you.
Sorry to say but the issue has still not been fixed, including driver version 1.1.7 on Aurora Serverless v2 MySQL:
ARJUNA016041: prepare on < formatId=131077, gtrid_length=44, bqual_length=36, tx_uid=0:ffffac107cdd:-24ffcc5f:649e3775:1becf, node_name=6d58d8745b-jjrzr, branch_uid=0:ffffac107cdd:-24ffcc5f:649e3775:1bed4, subordinatenodename=null, eis_name=java:jboss/datasources/.... > (XAResourceWrapperImpl@37fce128[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@5c31c310 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/...]) failed with exception XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state
Once again - this doesn't happen with the vanilla MySQL driver 8.xx
FWIW, I can also see this here in the logs: WFLYJCA0005: Deploying non-JDBC-compliant driver class software.aws.rds.jdbc.mysql.Driver
This is on Wildfly 26.1.3
XA Driver config:
<drivers>
<driver name="aws" module="com.mysql">
<driver-class>software.aws.rds.jdbc.mysql.Driver</driver-class>
<xa-datasource-class>software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXADataSource</xa-datasource-class>
</driver>
</drivers>
ARJUNA012075: Action Aborting
ARJUNA012073: BasicAction.End() - prepare phase of action-id 0:ffffac107cdd:-24ffcc5f:649e3775:1becf failed.
.....
at com.mysql//com.sun.proxy.$Proxy260.execute(Unknown Source)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:345)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:274)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyIfNeeded(FailoverConnectionPlugin.java:735)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:192)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:219) at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)
It's the same with 1.1.8
Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:216)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:189)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyIfNeeded(FailoverConnectionPlugin.java:735)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:275)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:352)
at com.mysql//com.sun.proxy.$Proxy233.execute(Unknown Source)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAConnection.dispatchCommand(MysqlXAConnection.java:328)
It seems as if this is only happening under high load ( and XA data source).
In most test cases, the transactions work just fine.
Hi @flauschie, thank you for the update. We merged a fix in (#432), could you please try out our snapshot build and let us know if the issue persists? Thank you.
Hi @karenc-bq
Yes, it seems that this latest fix (aws-mysql-jdbc-1.1.9-20230710.234101-4.jar) is working - at least during a 1h test campaign. Thx!
BTW, I don't know whether this is related - a while back, we also had XA issues with the RDS Proxy (with no fix from the AWS support at the time) - preventing us from going ahead with it. Is this change also going to be incorporated into the RDS Proxy? The symptoms were somewhat different, though.
BTW, I don't know whether this is related - a while back, we also had XA issues with the RDS Proxy (with no fix from the AWS support at the time) - preventing us from going ahead with it. Is this change also going to be incorporated into the RDS Proxy? The symptoms were somewhat different, though.
RDS Proxy does not use this driver. I doubt the problem is the same problem, thanks!
Hi @flauschie,
Version 1.1.9 of the AWS JDBC Driver for MySQL has been released with the fix for this issue. We'll be closing this issue now. If you have any questions or concerns please don't hesitate to create a new issue.