drift icon indicating copy to clipboard operation
drift copied to clipboard

Transactions randomly lock / freeze

Open Sorunome opened this issue 5 years ago • 11 comments

It seems that transactions randomly lock / freeze.

Unfortunately I do not have a minimal-case code scenario, always when trying to construct one the transaction locks were working as expected. The only case i found is very, very deep with a very specific scenario within our app...

So, I wanted to ask where in the code transaction locks are happening, so that i can debug them more easily (and perhaps submit a PR!)

Sorunome avatar Sep 04 '20 11:09 Sorunome

Thanks for reporting this and for offering to help!

There are independent transaction locks in the VmDatabase and MoorIsolate implementation. So if you're using MoorIsolate, checking whether the deadlock also happens with a VmDatabase directly might be helpful to track this down.

For VmDatabase:

  • we use package:synchronized with the main lock being applied here (isSequential is true for VmDatabase): https://github.com/simolus3/moor/blob/5cbc331ddaf7a3031d98a905c2e05080d9fbb2c8/moor/lib/src/runtime/executor/helpers/engines.dart#L20-L27
  • transactions apply that lock here: https://github.com/simolus3/moor/blob/5cbc331ddaf7a3031d98a905c2e05080d9fbb2c8/moor/lib/src/runtime/executor/helpers/engines.dart#L141-L154 Basically, they lock the whole database until the transaction is done. You could add logs to send and rollback to verify that _sendCalled does complete when desired.

For MoorIsolate (implementation here), we keep a map of <int, QueryExecutor> to identify query executors by an incrementing id. We have one active id at the time and a list of executors waiting in a backlog. Whenever a client wants to do anything, it also passes the executor id. The server will delay the request until it's that executor's turn (in _waitForTurn). You can also enable isolateDebugLog: true on MoorIsolate.connect to print messages sent to the isolate, maybe that helps to see where this is happening.

simolus3 avatar Sep 04 '20 12:09 simolus3

We are indeed using isolates, thank you for explaining in-depth how it works, I'll get back after some more debugging!

Sorunome avatar Sep 04 '20 12:09 Sorunome

Alright, debugging a bit more: The code works fine with VmDatabase, no lockups happen. So, the issue must be within the isolate code.

Now, looking at the output of isolateDebug it seems the following is relevant:

I/flutter ( 8003): [OUT]: request (id = 1551): _StatementMethod.select: SELECT * FROM events WHERE client_id = :client_id AND event_id = :event_id AND room_id = :room_id with [-snip-] (@180)
I/flutter ( 8003): [OUT]: request (id = 1552): EnsureOpen(6, null)
I/flutter ( 8003): [IN]: response (id = 1552): true
I/flutter ( 8003): [OUT]: request (id = 1553): RunTransactionAction(_TransactionControl.begin, null)
I/flutter ( 8003): [IN]: response (id = 1553): 181
I/flutter ( 8003): [OUT]: request (id = 1554): _StatementMethod.deleteOrUpdate: UPDATE rooms SET membership = ?, highlight_count = ?, notification_count = ? WHERE room_id = ? AND client_id = ?; with [-snip-] (@181)
I/flutter ( 8003): [IN]: response (id = 1554): 1
I/flutter ( 8003): [OUT]: request (id = 1555): _StatementMethod.deleteOrUpdate: UPDATE inbound_group_sessions SET indexes = :indexes WHERE client_id = :client_id AND room_id = :room_id AND session_id = :session_id with [-snip-] (@181)
I/flutter ( 8003): [OUT]: request (id = 1556): _StatementMethod.insert: INSERT OR REPLACE INTO events (client_id, event_id, room_id, sort_order, origin_server_ts, sender, type, unsigned, content, prev_content, state_key, status) VALUES (:client_id, :event_id, :room_id, :sort_order, :origin_server_ts, :sender, :type, :unsigned, :content, :prev_content, :state_key, :status) with [-snip-] (@181)
I/flutter ( 8003): [IN]: response (id = 1555): 1
I/flutter ( 8003): [IN]: response (id = 1556): 31864
I/flutter ( 8003): [OUT]: request (id = 1557): _StatementMethod.deleteOrUpdate: DELETE FROM events WHERE client_id = :client_id AND event_id = :event_id AND room_id = :room_id with [-snip-] (@181)
I/flutter ( 8003): [IN]: response (id = 1557): 1

We can see that the top request, the one with the id 1551, never got a reply. It also does not get a reply further down the logs. The logs are very noisy due to the app being a chat app and the account where this appears being rather large, so lots of database stuff is happening.

So, from this, it seems that the issue is in the isolate server, as it never replies.

Any additional pointers would be great!

The actual contents of the variables have been -snip-'d

Sorunome avatar Sep 06 '20 09:09 Sorunome

Alright, it appears that here https://github.com/simolus3/moor/blob/master/moor/lib/src/runtime/isolate/server.dart#L169

The server goes into _waitForTurn with a non-null transaction id, but the _executorBacklog array is empty and thus doesn't contain the value at all.

Of course a quickfix would be to add to idIsActive an || _executorBacklog.isEmpty, but that might collide with null transactions ID and only masks the actual problem

Sorunome avatar Sep 06 '20 09:09 Sorunome

I noticed that the same executor id is used for multiple queries at once, and, the problem here is, that it is still used after _releaseExecutor is called.

If the executor id is supposed to be the same throughout one transaction - it seems the cleanup does not happen properly, it is way longer lived than just a single transaction

Sorunome avatar Sep 06 '20 09:09 Sorunome

[...] Of course a quickfix would be to add to idIsActive an || _executorBacklog.isEmpty, but that might collide with null transactions ID and only masks the actual problem

If you actually do that you get

I/flutter ( 9489): [IN]: error response (id = 595): "NoSuchMethodError: The method 'runInsert' was called on null.\nReceiver: null\nTried calling: runInsert(\"INSERT OR REPLACE INTO events (client_id, event_id, room_id, sort_order, origin_server_ts, sender, type, unsigned, content, prev_content, state_key, status) VALUES (:client_id, :event_id, :room_id, :sort_order, :origin_server_ts, :sender, :type, :unsigned, :content, :prev_content, :state_key, :status)\", Instance(length:12) of '_GrowableList')" at #0      Object.noSuchMethod (dart:core-patch/object_patch.dart:51:5)
I/flutter ( 9489): #1      _MoorServer._runQuery (package:moor/src/runtime/isolate/server.dart:90:25)
I/flutter ( 9489): <asynchronous suspension>
I/flutter ( 9489): #2      _MoorServer._handleRequest (package:moor/src/runtime/isolate/server.dart:58:14)
I/flutter ( 9489): #3      IsolateCommunication.setRequestHandler.<anonymous closure> (package:moor/src/runtime/isolate/communication.dart:163:31)
I/flutter ( 9489): #4      _RootZone.runUnaryGuarded (dart:async/zone.dart:1384:10)
I/flutter ( 9489): #5      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dar
E/flutter ( 9489): [ERROR:flutter/lib/ui/ui_dart_state.cc(171)] Unhandled Exception: "NoSuchMethodError: The method 'runInsert' was called on null.\nReceiver: null\nTried calling: runInsert(\"INSERT OR REPLACE INTO events (client_id, event_id, room_id, sort_order, origin_server_ts, sender, type, unsigned, content, prev_content, state_key, status) VALUES (:client_id, :event_id, :room_id, :sort_order, :origin_server_ts, :sender, :type, :unsigned, :content, :prev_content, :state_key, :status)\", Instance(length:12) of '_GrowableList')"
E/flutter ( 9489): #0      Object.noSuchMethod (dart:core-patch/object_patch.dart:51:5)
E/flutter ( 9489): #1      _MoorServer._runQuery (package:moor/src/runtime/isolate/server.dart:90:25)
E/flutter ( 9489): <asynchronous suspension>
E/flutter ( 9489): #2      _MoorServer._handleRequest (package:moor/src/runtime/isolate/server.dart:58:14)
E/flutter ( 9489): #3      IsolateCommunication.setRequestHandler.<anonymous closure> (package:moor/src/runtime/isolate/communication.dart:163:31)
E/flutter ( 9489): #4      _RootZone.runUnaryGuarded (dart:async/zone.dart:1384:10)
E/flutter ( 9489): #5      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:357:11)
E/flutter ( 9489): #6      _DelayedData.perform (dart:async/stream_impl.dart:611:14)
E/flutter ( 9489): #7      _StreamImplEvents.handleNext (dart:async/stream_impl.dart:730:11)
E/flutter ( 9489): #8      _PendingEvents.schedule.<anonymous closure> (dart:async/stream_impl.dart:687:7)
E/flutter ( 9489): #9      _microtaskLoop (dart:async/schedule_microtask.dart:41:21)
E/flutter ( 9489): #10     _startMicrotaskLoop (dart:async/schedule_microtask.dart:50:5)
E/flutter ( 9489): #11     _runPendingImmediateCallback (dart:isolate-patch/isolate_patch.dart:118:13)
E/flutter ( 9489): #12     _RawReceivePortImpl._handleMessage (dart:isolate-patch/isolate_patch.dart:169:5)

Because the executor got removed from the map already

Sorunome avatar Sep 06 '20 09:09 Sorunome

I finally figured out a minimal example!

        print('Starting test...');
        await database.transaction(() async {
          print('before select 1');
          await database.customSelect('SELECT 1').get();
          print('after select 1');
          // note the lacking await. It is expected that this runs *after* the outer transaction is done
          database.transaction(() async {
            print('before select 3');
            await database.customSelect('SELECT 1').get();
            print('after select 3');
            await Future.delayed(Duration(seconds: 2));
            print('before select 4');
            await database.customSelect('SELECT 1').get();
            print('after select 4');
          });
          await Future.delayed(Duration(seconds: 1));
          print('before select 2');
          await database.customSelect('SELECT 1').get();
          print('after select 2');
        });

Expected output:

I/flutter (10265): Starting test...
I/flutter (10265): before select 1
I/flutter (10265): after select 1
I/flutter (10265): before select 2
I/flutter (10265): after select 2
I/flutter (10265): before select 3
I/flutter (10265): after select 3
I/flutter (10265): before select 4
I/flutter (10265): after select 4

Actual output (note it never outputs "after select 4"):

I/flutter (10265): Starting test...
I/flutter (10265): before select 1
I/flutter (10265): after select 1
I/flutter (10265): before select 3
I/flutter (10265): after select 3
I/flutter (10265): before select 2
I/flutter (10265): after select 2
I/flutter (10265): before select 4

Output when running with VmDatabase (also not really correct, but at least no lockup):

I/flutter (10605): Starting test...
I/flutter (10605): before select 1
I/flutter (10605): after select 1
I/flutter (10605): before select 3
I/flutter (10605): after select 3
I/flutter (10605): before select 2
I/flutter (10605): after select 2
I/flutter (10605): before select 4
I/flutter (10605): after select 4

Other test code which should be roughly equivalend and, when using isolates, does indeed output the expected output

        print('Starting test...');
        (() async {
          await Future.delayed(Duration(milliseconds: 500));
          await database.transaction(() async {
            print('before select 3');
            await database.customSelect('SELECT 1').get();
            print('after select 3');
            await Future.delayed(Duration(seconds: 2));
            print('before select 4');
            await database.customSelect('SELECT 1').get();
            print('after select 4');
          });
        })();
        await database.transaction(() async {
          print('before select 1');
          await database.customSelect('SELECT 1').get();
          print('after select 1');

          await Future.delayed(Duration(seconds: 1));
          print('before select 2');
          await database.customSelect('SELECT 1').get();
          print('after select 2');
        });

Sorunome avatar Sep 06 '20 10:09 Sorunome

As i thought it probably has to do with transaction-within-transaction i also ran the following test which also locked up:

        print('Starting test...');
        await database.transaction(() async {
          print('before select 1');
          await database.customSelect('SELECT 1').get();
          print('after select 1');
          // note the lacking await. It is expected that this runs *after* the outer transaction is done
          Timer(Duration(milliseconds: 500), () async {
            await database.transaction(() async {
              print('before select 3');
              await database.customSelect('SELECT 1').get();
              print('after select 3');
              await Future.delayed(Duration(seconds: 2));
              print('before select 4');
              await database.customSelect('SELECT 1').get();
              print('after select 4');
            });
          });
          await Future.delayed(Duration(seconds: 1));
          print('before select 2');
          await database.customSelect('SELECT 1').get();
          print('after select 2');
        });

I added an additional suspention here, to try to get the code to the other scenario where it worked. This, however, did not work and moor still freezes up

Sorunome avatar Sep 06 '20 11:09 Sorunome

Thanks for investigating this and for providing minimal repros!

However, note that moor doesn't support nested transactions. When you call transaction in a transaction zone, it will simply re-use the outer transaction. This is why, in your first repro, select 3 is happening before select 2. Since you're not awaiting the inner transaction, the outer transaction is committed before select 4 is running.

Moor shouldn't deadlock in that case of course, but not awaiting transactions (or any future inside a transaction) is dangerous for this reason. I added assertions for this scenario, so that this can fail instead. This works in a VmDatabase, but the fourth select is actually running outside a transaction which is even worse. I added an assertion in that implementation too.

If you want to support syntactically-nested transactions that actually run sequentially, you can use:

Zone.root.run(() {
  // note the lacking await. It is expected that this runs *after* the outer transaction is done
  database.transaction(() async {
    print('before select 3');
    await database.customSelect('SELECT 1').get();
    print('after select 3');
    await Future.delayed(Duration(seconds: 2));
    print('before select 4');
    await database.customSelect('SELECT 1').get();
    print('after select 4');
  });
});

That prints what you expect.

I can add an optional parameter to transaction to avoid re-using the outer transaction. It would then behave as your comment describes, running after the outer transaction is done.

simolus3 avatar Sep 06 '20 12:09 simolus3

When you call transaction in a transaction zone, it will simply re-use the outer transaction. This is why, in your first repro, select 3 is happening before select 2. Since you're not awaiting the inner transaction, the outer transaction is committed before select 4 is running.

Aaaah, didn't know of darts Zone execution yet, which is apparently used for transactions. With the timer i had attempted to simulate something like Zone.root.run.

Moor shouldn't deadlock in that case of course, but not awaiting transactions (or any future inside a transaction) is dangerous for this reason.

Of course the code is way more complicated, using Zone.root.run actually seems like a rather clean solution for the specific code we have :)

I can add an optional parameter to transaction to avoid re-using the outer transaction. It would then behave as your comment describes, running after the outer transaction is done.

This sounds like it could also be beneficial to other people. And/or, what would be more applicable in our specific case: Mark the parent transaction that all its child transactions don't re-use the parent.

Thanks a lot for all the debugging help, it was really helpful! And yeah, a more sane error would have helped this a lot :)

Sorunome avatar Sep 06 '20 12:09 Sorunome

Should this be closed, probably?

srix55 avatar Aug 03 '23 13:08 srix55