deltachat-core-rust
deltachat-core-rust copied to clipboard
CI failed with `database is locked`
Here is a failure log:
snapshot = group.get_basic_snapshot()
assert snapshot.name == "test group"
group.set_name("new name")
snapshot = group.get_full_snapshot()
assert snapshot.name == "new name"
> msg = group.send_message(text="hi")
tests/test_something.py:221:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/chat.py:144: in send_message
msg_id = self._rpc.send_msg(self.account.id, self.id, draft)
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f6be33ff230>
def __call__(self):
self.event.wait()
response = self.rpc.request_results.pop(self.request_id)
if "error" in response:
> raise JsonRpcError(response["error"])
E deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n Error code 5: The database file is locked'}
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:28: JsonRpcError
------------------------------ Captured log call -------------------------------
We should at least add more context to this error to know which SQL query failed. But if it is an SQLITE_LOCKED, this should never happen and it indicates there is some problem with converting read transaction to write transaction. Maybe some INSERT with SELECT that should be wrapped into immediate transaction to avoid SELECT getting a read lock that needs to be later converted to write lock or something like this?
5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.
Another failure (on ubuntu-latest), this time about remove_contact:
assert (msg.get_snapshot()).text == "hi"
group.forward_messages([msg])
group.set_draft(text="test draft")
draft = group.get_draft()
assert draft.text == "test draft"
group.remove_draft()
assert not group.get_draft()
assert group.get_messages()
group.get_fresh_message_count()
group.mark_noticed()
assert group.get_contacts()
> group.remove_contact(alice_chat_bob)
tests/test_something.py:235:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/chat.py:231: in remove_contact
self._rpc.remove_contact_from_chat(self.account.id, self.id, contact_id)
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f661c481650>
def __call__(self):
self.event.wait()
response = self.rpc.request_results.pop(self.request_id)
if "error" in response:
> raise JsonRpcError(response["error"])
E deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n Error code 5: The database file is locked'}
It sends a message internally, so could be the same problem.
Running while pytest -s tests/test_something.py::test_chat ; do :;done locally in 3 terminals in parallel eventually fails. Probably one is enough as two failed, running in parallel just makes it fail faster.
With much more logging I collected these errors:
deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to remove contact from chats_contacts table: Failed to EXECUTE SQL query "DELETE FROM chats_contacts WHERE chat_id=? AND contact_id=?": database is locked: Error code 5: The database file is locked'}
deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to send created message: send_msg_inner: prepare_send_msg: Failed to create send jobs: Failed to update message subject: Failed to EXECUTE SQL query "UPDATE msgs SET subject=? WHERE id=?;": database is locked: Error code 5: The database file is locked'}
deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to create group: Failed to INSERT SQL query "INSERT INTO chats\\n (type, name, grpid, param, created_timestamp)\\n VALUES(?, ?, ?, \'U=1\', ?);": database is locked: Error code 5: The database file is locked'}
deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to add contact to chat: Transaction body failed: database is locked: Error code 5: The database file is locked'}
So all sorts of JSON-RPC requests fail, even plain INSERT.
WAL documentation has this list of reasons SQLITE_BUSY may be returned:
https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
Seems the list is non-exhaustive. Maybe some connection merges the WAL into the database or something like this.
probably
write_lock()is forgotten for a write transaction somewhere or so.
We have a very limited API in sql.rs that does not allow to just get a connection without write lock and start writing there. It's possible to execute something like INSERT or UPDATE inside query_row (and we actually have two RETURNING statements that do so, but in a transaction), but I don't think this is what happens here.
Currently I plan to try:
- ~Replacing
prepare_cachedwithprepareto avoid long-lived prepared statements.~ (edit: did not help) - Try reducing the testcase
I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.
Will try to set pragma query_only=0 or pragma query_only=1 when I get a write or read-only connection to make sure we never write into read-only connection.
5 is SQLITE_BUSY, but it indeed should never happen, probably
write_lock()is forgotten for a write transaction somewhere or so.
I made a PR that rules out the possibility of us accidentally writing to a read connection: https://github.com/deltachat/deltachat-core-rust/pull/6076
I cannot reproduce it anymore, but probably because of the network conditions change or something like that. Don't think it is fixed for real, but without the ability to reproduce it cannot debug it further.
Now that we have query_only argument to Pool.get() anyway since #6076, maybe we should indeed dedicate a single write connection in the Pool and instead of having write mutex in SQL make Pool.get() wait until write connection is returned. Pool locks connections behind mutex anyway, can as well have a write_connection: Mutex<Option<Connection>>.
If it does not negatively affect performance, having a separate connection instead of toggling query_only all the time looks a bit cleaner.
Even without dedicated write connection, write mutex is better moved inside the Pool so it never returns two query_only=0 connections at the same time. EDIT: made a PR #6084 implementing it.
See discussion at https://github.com/deltachat/deltachat-core-rust/pull/6076#discussion_r1807836810
I tried setting
PRAGMA wal_autocheckpoint=0;next tobusy_timeout = 0and it did not help, so it is not about some connection doing WAL checkpointing.
Still, seems we need to be prepared for this. Just in case citing https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode here to not forget about it:
When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.
But we are never closing connections, so how can this happen? Connections returned to the pool are not closed.
You're right, but then there's another problem that WAL swells indefinitely. E.g. my WAL now is ~200M which is much more than 1000 pages (SQLITE_DEFAULT_WAL_AUTOCHECKPOINT macro).
Strange, the largest WAL I have here is 8.7M.
My DC Desktop is running since Sep 21, maybe this is the reason?
I think I found the reason: we run PRAGMA incremental_vacuum via query_row_optional that uses read-only connection. With the recent changes it will always fail:
sqlite> PRAGMA query_only=1;
sqlite> PRAGMA incremental_vacuum;
Runtime error: attempt to write a readonly database (8)
Here is the code that should run from call_write or similar:
https://github.com/deltachat/deltachat-core-rust/blob/61fd0d400fd2dbd40ba9494d8d8a65a0c5749d9f/src/sql.rs#L718-L725
This is also what happens in recent CI failures:
11.66 [events-ac4] INFO src/sql.rs:843: Start housekeeping...
11.66 [events-ac4] INFO src/sql.rs:882: 3 files in use.
11.66 [events-ac4] WARNING src/sql.rs:769: Failed to run incremental vacuum: attempt to write a readonly database: Error code 8: Attempt to write a readonly database.
11.66 [events-ac4] INFO src/sql.rs:824: Housekeeping done.
We run housekeeping once on just created contexts, so it makes sense that this affects the tests.
We can already close this issue as query_only prevents locking the database with a read connection, but for PRAGMA incremental_vacuum to work I made a PR #6087.
Regarding WAL I opened an issue https://github.com/deltachat/deltachat-core-rust/issues/6089
#6087 is merged but there was another failure: https://github.com/deltachat/deltachat-core-rust/pull/6087#issuecomment-2430157225
Let's close this until it fails again, at least now failures should be greatly reduced as incremental_vacuum does not happen on read connection immediately after account creation.