deltachat-core-rust CI failed with `database is locked`

Here is a failure log:

         snapshot = group.get_basic_snapshot()
        assert snapshot.name == "test group"
        group.set_name("new name")
        snapshot = group.get_full_snapshot()
        assert snapshot.name == "new name"
    
>       msg = group.send_message(text="hi")

tests/test_something.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/chat.py:144: in send_message
    msg_id = self._rpc.send_msg(self.account.id, self.id, draft)
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f6be33ff230>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:28: JsonRpcError
------------------------------ Captured log call -------------------------------

We should at least add more context to this error to know which SQL query failed. But if it is an SQLITE_LOCKED, this should never happen and it indicates there is some problem with converting read transaction to write transaction. Maybe some INSERT with SELECT that should be wrapped into immediate transaction to avoid SELECT getting a read lock that needs to be later converted to write lock or something like this?

Oct 19 '24 15:10 link2xt

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

Oct 19 '24 15:10 iequidoo

Another failure (on ubuntu-latest), this time about remove_contact:

         assert (msg.get_snapshot()).text == "hi"
        group.forward_messages([msg])
    
        group.set_draft(text="test draft")
        draft = group.get_draft()
        assert draft.text == "test draft"
        group.remove_draft()
        assert not group.get_draft()
    
        assert group.get_messages()
        group.get_fresh_message_count()
        group.mark_noticed()
        assert group.get_contacts()
>       group.remove_contact(alice_chat_bob)

tests/test_something.py:235: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/chat.py:231: in remove_contact
    self._rpc.remove_contact_from_chat(self.account.id, self.id, contact_id)
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f661c481650>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

It sends a message internally, so could be the same problem.

Oct 19 '24 20:10 link2xt

Running while pytest -s tests/test_something.py::test_chat ; do :;done locally in 3 terminals in parallel eventually fails. Probably one is enough as two failed, running in parallel just makes it fail faster.

Oct 20 '24 08:10 link2xt

With much more logging I collected these errors:

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to remove contact from chats_contacts table: Failed to EXECUTE SQL query "DELETE FROM chats_contacts WHERE chat_id=? AND contact_id=?": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to send created message: send_msg_inner: prepare_send_msg: Failed to create send jobs: Failed to update message subject: Failed to EXECUTE SQL query "UPDATE msgs SET subject=? WHERE id=?;": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to create group: Failed to INSERT SQL query "INSERT INTO chats\\n        (type, name, grpid, param, created_timestamp)\\n        VALUES(?, ?, ?, \'U=1\', ?);": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to add contact to chat: Transaction body failed: database is locked: Error code 5: The database file is locked'}

So all sorts of JSON-RPC requests fail, even plain INSERT.

Oct 20 '24 11:10 link2xt

WAL documentation has this list of reasons SQLITE_BUSY may be returned: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode Seems the list is non-exhaustive. Maybe some connection merges the WAL into the database or something like this.

Oct 20 '24 11:10 link2xt

probably write_lock() is forgotten for a write transaction somewhere or so.

We have a very limited API in sql.rs that does not allow to just get a connection without write lock and start writing there. It's possible to execute something like INSERT or UPDATE inside query_row (and we actually have two RETURNING statements that do so, but in a transaction), but I don't think this is what happens here.

Currently I plan to try:

~Replacing prepare_cached with prepare to avoid long-lived prepared statements.~ (edit: did not help)
Try reducing the testcase

Oct 20 '24 11:10 link2xt

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

Oct 20 '24 12:10 link2xt

Will try to set pragma query_only=0 or pragma query_only=1 when I get a write or read-only connection to make sure we never write into read-only connection.

Oct 20 '24 12:10 link2xt

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

I made a PR that rules out the possibility of us accidentally writing to a read connection: https://github.com/deltachat/deltachat-core-rust/pull/6076

Oct 20 '24 13:10 link2xt

I cannot reproduce it anymore, but probably because of the network conditions change or something like that. Don't think it is fixed for real, but without the ability to reproduce it cannot debug it further.

Oct 20 '24 17:10 link2xt

Now that we have query_only argument to Pool.get() anyway since #6076, maybe we should indeed dedicate a single write connection in the Pool and instead of having write mutex in SQL make Pool.get() wait until write connection is returned. Pool locks connections behind mutex anyway, can as well have a write_connection: Mutex<Option<Connection>>.

If it does not negatively affect performance, having a separate connection instead of toggling query_only all the time looks a bit cleaner.

Even without dedicated write connection, write mutex is better moved inside the Pool so it never returns two query_only=0 connections at the same time. EDIT: made a PR #6084 implementing it.

See discussion at https://github.com/deltachat/deltachat-core-rust/pull/6076#discussion_r1807836810

Oct 21 '24 10:10 link2xt

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

Still, seems we need to be prepared for this. Just in case citing https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode here to not forget about it:

When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.

Oct 21 '24 18:10 iequidoo

But we are never closing connections, so how can this happen? Connections returned to the pool are not closed.

Oct 21 '24 18:10 link2xt

You're right, but then there's another problem that WAL swells indefinitely. E.g. my WAL now is ~200M which is much more than 1000 pages (SQLITE_DEFAULT_WAL_AUTOCHECKPOINT macro).

Oct 21 '24 19:10 iequidoo

Strange, the largest WAL I have here is 8.7M.

Oct 21 '24 19:10 link2xt

My DC Desktop is running since Sep 21, maybe this is the reason?

Oct 21 '24 19:10 iequidoo

I think I found the reason: we run PRAGMA incremental_vacuum via query_row_optional that uses read-only connection. With the recent changes it will always fail:

sqlite> PRAGMA query_only=1;
sqlite> PRAGMA incremental_vacuum;
Runtime error: attempt to write a readonly database (8)

Here is the code that should run from call_write or similar: https://github.com/deltachat/deltachat-core-rust/blob/61fd0d400fd2dbd40ba9494d8d8a65a0c5749d9f/src/sql.rs#L718-L725

This is also what happens in recent CI failures:

11.66 [events-ac4] INFO src/sql.rs:843: Start housekeeping...
11.66 [events-ac4] INFO src/sql.rs:882: 3 files in use.
11.66 [events-ac4] WARNING src/sql.rs:769: Failed to run incremental vacuum: attempt to write a readonly database: Error code 8: Attempt to write a readonly database.
11.66 [events-ac4] INFO src/sql.rs:824: Housekeeping done.

We run housekeeping once on just created contexts, so it makes sense that this affects the tests.

Oct 21 '24 20:10 link2xt

We can already close this issue as query_only prevents locking the database with a read connection, but for PRAGMA incremental_vacuum to work I made a PR #6087.

Oct 21 '24 20:10 link2xt

Regarding WAL I opened an issue https://github.com/deltachat/deltachat-core-rust/issues/6089

Oct 21 '24 21:10 link2xt

#6087 is merged but there was another failure: https://github.com/deltachat/deltachat-core-rust/pull/6087#issuecomment-2430157225

Oct 22 '24 23:10 link2xt

Let's close this until it fails again, at least now failures should be greatly reduced as incremental_vacuum does not happen on read connection immediately after account creation.

Oct 25 '24 19:10 link2xt

deltachat-core-rust deltachat-core-rust copied to clipboard

CI failed with `database is locked`

deltachat-core-rust
deltachat-core-rust copied to clipboard