akka.net icon indicating copy to clipboard operation
akka.net copied to clipboard

Upgrade from Akka.net 1.3.18 -> 1.4.7 breaks something with regards to Transactions

Open thomaslazar opened this issue 5 years ago • 5 comments

Our current project runs with Akka.net 1.3.8 and assorted packages like persistence, DI, etc.. Our projects are currently .NET Framework 4.8 and run under Windows.

We upgraded all the packages to 1.4.5 (later to 1.4.7) and all hell broke loose. I can't really narrow it down that much but I will try to describe the problems we had.

We are using Akka.Persistence with SQL Server to have some persistence and at-least-once delivery actors going. Everything worked fine here with 1.3.18. After upgrading the libraries we started to get "Transactions aborted" and various other Transaction related messages. We use the BatchingSqlServerJournal and debugging into that didn't tell me much, but it died, when it tried to commit the batch transaction.

Tried going back to the SqlServerjournal class but I couldn't manage to get it to run, but I think that was mostly my fault.

In the end I replaced the whole SQL Server journal config with inmem. Then I started getting problems with transactions on the snapshot-store end. When I replaced that with inmem I started getting problems with an actor that uses .NET Transactions to wrap some code that should complete fully or not at all. I got various transaction related problems there when the .NET Transaction tried to commit.

I am currently on a rather tight schedule so I can't really go after the problems myself. My "solution" ATM is to roll everything back to 1.3.18.

thomaslazar avatar Jun 02 '20 06:06 thomaslazar

Hi @thomaslazar - do you have any error logs you can share? We'd love to look into this.

Aaronontheweb avatar Jun 03 '20 15:06 Aaronontheweb

This is a stack trace for a persistence actor that tried to commit stuff to the journal. This happened inside the ExecuteChunk method of the BatchingSqlJournal class. When it tries to commit the transaction. https://github.com/akkadotnet/akka.net/blob/48ce8e923e7add308623ada77a279d848ce05870/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L917

Logger: Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal

System.ObjectDisposedException Cannot access a disposed object. Object name: 'Transaction'. System.Transactions.Oletx.OletxTransaction ConvertToOletxTransaction(System.Transactions.Transaction) at System.Transactions.TransactionInterop.ConvertToOletxTransaction(Transaction transaction) at System.Transactions.TransactionInterop.GetExportCookie(Transaction transaction, Byte[] whereabouts) at System.Data.SqlClient.SqlInternalConnection.GetTransactionCookie(Transaction transaction, Byte[] whereAbouts) at System.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx) at System.Data.SqlClient.SqlInternalConnection.Enlist(Transaction tx) at System.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction) at System.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry) at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry) at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken) --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Akka.Persistence.Sql.Common.Journal.BatchingSqlJournal`2.<ExecuteChunk>d__70.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Akka.Util.Internal.AtomicState.<CallThrough>d__7`1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Akka.Util.Internal.AtomicState.<CallThrough>d__7`1.MoveNext()

Another exception that happened during journaling.

System.Transactions.TransactionException The operation is not valid for the state of the transaction. System.Transactions.TransactionStatus get_Status(System.Transactions.InternalTransaction) at System.Transactions.TransactionStatePSPEOperation.get_Status(InternalTransaction tx) at System.Transactions.TransactionInformation.get_Status() at System.Data.SqlClient.SqlInternalConnectionTds.CheckEnlistedTransactionBinding() at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalTransaction.Commit() at System.Data.SqlClient.SqlTransaction.Commit() at Akka.Persistence.Sql.Common.Journal.AbstractQueryExecutor.<InsertBatchAsync>d__51.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Akka.Persistence.Sql.Common.Journal.SqlJournal.<>c__DisplayClass30_0.<<WriteMessagesAsync>b__0>d.MoveNext()

And another one with journal as a problem.

System.InvalidOperationException The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. Void CheckEnlistedTransactionBinding() at System.Data.SqlClient.SqlInternalConnectionTds.CheckEnlistedTransactionBinding() at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalTransaction.Commit() at System.Data.SqlClient.SqlTransaction.Commit() at Akka.Persistence.Sql.Common.Journal.AbstractQueryExecutor.<InsertBatchAsync>d__51.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Akka.Persistence.Sql.Common.Journal.SqlJournal.<>c__DisplayClass30_0.<<WriteMessagesAsync>b__0>d.MoveNext()

And another one.

System.Data.SqlClient.SqlException New request is not allowed to start because it should come with valid transaction descriptor. Void OnError(System.Data.SqlClient.SqlException, Boolean, System.Action`1[System.Action]) at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalConnectionTds.PropagateTransactionCookie(Byte[] cookie) at System.Data.SqlClient.SqlInternalConnection.EnlistNull() at System.Data.SqlClient.SqlInternalConnectionTds.Activate(Transaction transaction) at System.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction) at System.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry) at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry) at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken) --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Akka.Persistence.Sql.Common.Journal.SqlJournal.<>c__DisplayClass30_0.<<WriteMessagesAsync>b__0>d.MoveNext()

This is an exception that happened after i replaced the journal and snapshot to inmem for actors that happen to use .NET Transactions inside the Receive method.

Ok. That's not 100% true. The actor called an async method that wrapped some calls inside a .NET Transaction. And the Actor was using a ContinueWith on the result. But still the Transaction theme persists.

System.Transactions.TransactionAbortedException The transaction has aborted. Void EndCommit(System.Transactions.InternalTransaction) at System.Transactions.TransactionStateAborted.EndCommit(InternalTransaction tx) at System.Transactions.CommittableTransaction.Commit() at System.Transactions.TransactionScope.InternalDispose() at System.Transactions.TransactionScope.Dispose() at
System.ObjectDisposedException Cannot access a disposed object. Object name: 'Transaction'. System.Transactions.IsolationLevel get_IsolationLevel() at System.Transactions.Transaction.get_IsolationLevel() at System.Transactions.TransactionScope..ctor(TransactionScopeOption scopeOption, TransactionOptions transactionOptions, TransactionScopeAsyncFlowOption asyncFlowOption) at 

This is all i can pull out of my elasticsearch instance.

Sadly I cannot provide a trace for the snapshot problems because i was in full on debug mode by that time and the exceptions weren't sent off to ES anymore.

I hope this helps.

I'm away next week so I will not be able to answer to further questions.

thomaslazar avatar Jun 05 '20 07:06 thomaslazar

@Aaronontheweb I can confirm this is also happening on a project I'm dealing with.

I can say that when this happened on our system, -sometimes- switching back to the Non-batching journal would work. Other times we had to clear some of the data out of our sharding tables to start back up.

I'll have to see if I can recreate a minimal repro, but for now can say we saw this behavior during some heavy stress tests of a Sharding setup. tl;dr- Spraying 1000 remote messages each to less than 100 AtLeastOnceDelivery actors should trigger the problem. We aren't using Transactions anywhere in our application code (at least that I can recall).

It may be worth noting that during this timeframe we were also seeing things like delayed heartbeat warnings (not unsurprising given we were flooding the system intentionally.) I'm thinking overall load may be a factor in triggering the problem.

One piece of digging I did do made me notice that SqlTransaction does -not- appear to have behavior that would cause a Rollback on Dispose if it was not already called; I'm not sure if that's causing connections to be stuck in a Zombie state. I don't know if I'll have time, but I am thinking of trying to make sure the transactions are explicitly rolled back if there's any exceptions.

to11mtm avatar Jun 11 '20 13:06 to11mtm

@thomaslazar @to11mtm I'll mark this issue as critical and ask one of our team members to take a look at it early next week. We have another big project that we're committed to right now but I'll make sure this gets some attention from us unless someone else submits a PR for it first.

Aaronontheweb avatar Jun 11 '20 13:06 Aaronontheweb

I've managed to make a Repro but it definitely requires a bit of cleanup and it's messy.

What I'm seeing is a pattern where under heavy writes a command within the batch transaction is killed due to lock contention, which I -think- is then causing the whole transaction to be invalidated.

I know that in Linq2Db SQL Server's transient errors can be detected and used alongside a retry policy.

The challenge in this case is that if I'm understanding correctly we have to rerun the entire batch.

That said, I'm not sure if what I'm observing is the same problem as originally in this issue or not...

> [WARNING][6/21/2020 5:54:47 AM][Thread 0018][akka://test/user/$a/$I] Failed to DeleteMessages ToSequenceNr [110] for PersistenceId [e55ecd87-b43e-4170-8fd3-0395ac8a9336] due to: [System.Data.SqlClient.SqlException (0x80131904): Transaction (Process ID 62) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader()
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQuery(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Persistence.Sql.Common.Journal.BatchingSqlJournal`2.HandleDeleteMessagesTo(DeleteMessagesTo req, TCommand command) in C:\github\akkadotnet\src\contrib\persistence\Akka.Persistence.Sql.Common\Journal\BatchingSqlJournal.cs:line 965
ClientConnectionId:739ee2e2-3b0e-4979-b0e2-42ca21ab0568
Error Number:1205,State:51,Class:13: Transaction (Process ID 62) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.]
[ERROR][6/21/2020 5:54:47 AM][Thread 0024][akka://test/system/akka.persistence.journal.sql-server] An error occurred during event batch processing (chunkId: 0)
Cause: System.InvalidOperationException: This SqlTransaction has completed; it is no longer usable.
   at System.Data.SqlClient.SqlTransaction.ZombieCheck()
   at System.Data.SqlClient.SqlTransaction.Commit()
   at Akka.Persistence.Sql.Common.Journal.BatchingSqlJournal`2.ExecuteChunk(RequestChunk chunk, IActorContext context) in C:\github\akkadotnet\src\contrib\persistence\Akka.Persistence.Sql.Common\Journal\BatchingSqlJournal.cs:line 917

to11mtm avatar Jun 22 '20 01:06 to11mtm