core Shutting off machine at the wrong time results in "The database disk image is malformed" with the UI being stuck during with no useful indication anything is wrong

(based on issue https://github.com/deltachat/deltachat-desktop/issues/4842 )

Sometimes, DeltaChat Desktop no longer syncs on one of my devices. This seems to be potentially a core bug. The UI sometimes will say "Updating..." for a while, but the messages shown in the main contact list never update and they're stuck roughly 10 days in the past, and no new messages are showing. Any individual chat won't load, it just say "Select a chat or create a new chat" and nothing ever happens.

Here's how the main contact list looks like while stuck:

During investigations, I obtained a --devmode terminal output log file which shows the following error: 7.2s �[1;33m[w]�[0m�[0;37mcore/event�[0m: 2 src/scheduler.rs:731: Failed fetch_idle: fetch_move_delete: fetch_new_messages: database disk image is malformed: Error code 11: The database disk image is malformed

Operating System (Linux/Mac/Windows/iOS/Android): Linux
Core Version: v1.156.2, v1.159.4
Client Version: 1.54.2, 1.58.2

Expected behavior

Shutting down the device at an unfortunate time won't leave Delta Chat unable to sync after the device starts up again.

Actual behavior

Shutting down the device at an unfortunate time apparently will leave Delta Chat unable to sync after the device starts up again.

Steps to reproduce the problem

Use Delta Chat
Hard shut off your device with power loss at just the wrong moment
I use btrfs. This means any sort of checksum failure due to filesystem corruption should lead to an I/O error on open()/fopen() and not allow any reading in the file, and I think there should also be a dmesg message. Since that didn't happen, I don't think a filesystem corruption is the likely cause here (although I did have that on the PinePhone in the past, likely due to how unreliable SD cards are, so I thought I would point that out to be completely transparent).

Screenshots

see above

Logs

see above

Jun 13 '25 11:06 ell1e

This doesn't look as a Core bug, but rather as an SQLite bug (but unlikely) or some problem with sync requests implementation by your device, e.g. they don't work as write barriers (https://www.sqlite.org/howtocorrupt.html mentions this along with other possible reasons, you can check them). Btrfs mustn't necessarily detect such corruptions by checksumming because AFAIU checksums are calculated separately for allocated fs blocks, but if blocks are written in the wrong order by your device, that won't help. Which device do you use?

Jun 13 '25 21:06 iequidoo

The PinePhone uses an SD Card, so I assume "Disk drives that do not honor sync requests" is potentially true. But the contents shouldn't be corrupted since BTRFS would catch that (it did in the past, so that's not theoretical).

Jun 13 '25 23:06 ell1e

It's unlikely that Btrfs checksums can protect from reordered writes because they are calculated for blocks independently. You would rather get some fs metadata mismatch, but if just two data (not metadata) blocks are written in the wrong order, i'd not expect the fs to detect anything unless you use an fs with full data journaling (e.g. ext4 has such a mode, Btrfs doesn't). Do you have Btrfs logs of the previous corruption? Which kind of file/software did that happen to, SQLite or smth else?

Jun 14 '25 10:06 iequidoo

Oh it can happen to any large file that you copy on there really, it's probably just SD cards occasionally bit flipping and being SD cards. I wouldn't recommend using SD cards with anything but a filesystem with checksums. It leads to a dmesg message pointing out the checksum error and the open() fails with an EIO or some similar errno value. If you write like, 50GiB of files then the chance of one of them having a bit flip isn't that small on an SD card, I tried multiple card vendors too.

Jun 14 '25 10:06 ell1e

So this time it's definitely not a bit flipping. I'd suggest you try another fs, say, ext4 in the default data=ordered mode and see what happens (i'd expect the same corruption). Btw, how often does this reproduce? Then you can mount the fs in the data=journal mode and make more tests. https://serverfault.com/a/914466 explains how data journaling helps to maintain data consistency w/o write barriers working/enabled. But you should make sure the journal is big enough, 102400 blocks is the maximum size apparently.

At least some experiments may help to understand how to use Delta Chat on such devices or falsify assumtions about possible reasons of the problem.

Jun 14 '25 11:06 iequidoo

It only really reproduces every 2-3 weeks or so, the Delta Chat getting stuck event, so it's not very often sadly. I use this device a lot so I can't easily switch out the filesystem. I had ext4 at some point long in the past, but I didn't use Delta Chat yet around that time.

Jun 14 '25 12:06 ell1e

I guess, you switched to Btrfs because of ext4 lacking data checksumming? Then Btrfs is better but also not ideal it seems. What you can do is to somehow script Btrfs auto-snapshotting (with rotation of old snapshots) and in case of a hard shutdown just roll back to some sufficiently old snapshot so that all data is already synced by your SD card there. You'll lose some amount of recently fsync()ed data, but anyway your SD card apparently doesn't guarantee the opposite and at least you won't get such a silent data corruption (it's actually silent because well, SQLite detects it, but other software doesn't necessarily). Still i'm not sure that will guarantee fs metadata consistency because if Btrfs moves around some metadata at the moment of a power loss, it may get corrupted, but at least not silently and i guess Btrfs should avoid that (i'm not familiar with the Btrfs source code, but usually doing so is a bad idea).

Jun 14 '25 13:06 iequidoo

I guess, you switched to Btrfs because of ext4 lacking data checksumming

Correct! My guess would be btrfs likely avoids every kind of silent corruption, other than perhaps actually guaranteeing an fsync() goes through if the SD card doesn't reliably report it, which I doubt that it does.

Jun 14 '25 13:06 ell1e

I have an interesting update. I have since seen this happen multiple times, but it seems like the actual database image being malformed was a fluke. Instead, most commonly the core seems to be stuck (and I mean for all practical purposes really stuck, I've had it sit for 40 minutes!! once and this continued on and on) with the log being filled with an endless stream of entries like these:

2025-07-01T18:42:20.964Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message fcada117-13f8-43bf-8431-ed95d77f6aee@localhost that we have seen before."
2025-07-01T18:42:20.972Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 5587968d-9a3f-4e2c-9b16-2b8df13cde05@localhost that we have seen before."
2025-07-01T18:42:20.977Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 0fcd52b4-725a-4b34-b558-19f6a0f055e3@localhost that we have seen before."
2025-07-01T18:42:20.984Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 10d11731-a6e4-4884-9b32-407aa1819837@localhost that we have seen before."
2025-07-01T18:42:20.995Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 8ec5ac8b-d9a9-44ba-8b92-0faa92a88f38@localhost that we have seen before."
2025-07-01T18:42:21.005Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 79db4b74-1d14-4f36-91d4-2d115953280f@localhost that we have seen before."
2025-07-01T18:42:21.014Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 5b150551-6507-4b3a-a9cb-d6401c537deb@localhost that we have seen before."
2025-07-01T18:42:21.026Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 36fdfb8b-b65f-4665-aa10-d828cc3536b2@localhost that we have seen before."
2025-07-01T18:42:21.035Z    core/event              INFO    ""  1   "src/imap.rs:627: Not moving the message 02b132cb-2e85-49bf-98da-431679e12d85@localhost that we have seen before."

Sometimes it wakes up again after a while, I've seen it at some point be in this state for only like 10 minutes, but sometimes it either doesn't or it's so late that for all practical purposes it makes more sense to nuke the install and restore from a backup.

Jul 01 '25 19:07 ell1e

Btw, #6877 removes this logic and Core will try to move all messages from the Inbox and Spam folders as if it sees messages for the first time. But i don't think that will resolve the problem. It seems that all messages in the folder are prefetched again from scratch. What folder is it?

Jul 02 '25 13:07 iequidoo

Okay, here's details about my setup:

I'm guessing what it's fetching from would be the "DeltaChat" folder, I always have the checkbox set to always use the "DeltaChat" folder only.
I don't use any deeper subfolders or any regular mail activity inside DeltaChat, only outside in the other folders (that aren't nested in "DeltaChat") that it's not supposed to be watching.
The IMAP account is shared with a classic mail account, but the DeltaChat traffic uses a separate alias and goes to the subfolder via a server-side rule that moves the messages into that sub folder too, if DeltaChat at any point doesn't do that already.
Any sort of auto deletion in DeltaChat is disabled, since the warning text so clearly tells me not to use it for just a subfolder. Instead, I clean up the folder using Thunderbird sometimes. Multiple DeltaChat clients share this setup and account, so messages are retained on the server until I delete tthe old ones after usually around 20-30 days.

Jul 02 '25 14:07 ell1e

I wonder if there is still something useful I could investigate, or perhaps some idea to work around this?

The problem is less so that it redownloads message headers at some point, but that it seems to download them for a long time and compare and just take a really long time to do it for so many messages. So I feel like this is made more of an issue simply because it can take multiple hours. Afterward the client usually becomes awake again, therefore it's a sort of transient and in theory rather benign issue.

Jul 18 '25 13:07 ell1e

2025-07-01T18:42:20.964Z core/event INFO "" 1 "src/imap.rs:627: Not moving the message fcada117-13f8-43bf-8431-ed95d77f6aee@localhost that we have seen before."

Do you have any logs preceding these lines? Maybe there's smth that may help to understand why all messages in the folder are prefetched again (e.g. the folder's UIDVALIDITY changed).

Jul 20 '25 15:07 iequidoo

I would be happy to e-mail the full log file of a fresh launch! It's still stuck in this state, it seems like.

Jul 20 '25 16:07 ell1e

I think you can send the full log to relevant members of the "chatmail core" or "Chatmail adventures" Delta Chat group (i'm @iequidoo there as well, maybe somebody else is interested too).

Jul 20 '25 16:07 iequidoo

There is another reference: https://sqlite.org/forum/info/e42b145171b6fdb32b3033f350c4f1692a8e0b7f015ceec624b595b6055777bd Generally SD cards are not supported by SQLite it seems, and filesystem cannot do anything about it. Even if WAL mode is switched off and there is a single file, database corruption is possible. So original issue of database corruption is a "won't fix", can only be closed as not planned.

Regarding "Not moving the message" log messages that come from here: https://github.com/chatmail/core/blob/7f3648f8ae0e98f28c4a3eccc428e712b12af39a/src/imap.rs#L625-L628 It does not mean we would move the message otherwise, if it is an old message it is likely already in its target folder. This usually happens if fetching messages timed out and we have not updated the lowest seen UID. Since we have merged https://github.com/chatmail/core/pull/6997 (core 2.2.0) lowest UID is updated as long as some messages are fetched successfully even if connection times out while fetching new messages.

Jul 22 '25 23:07 link2xt

While I understand you can't easily do much about, I wonder if this is also an issue for Android and iOS? Perhaps this is a horrible suggestion, but depending on how large this database is (e.g. if you store the media files inside the database or outside), one easy workaround could be to keep a copy from the last time the app started to go back to. At least when DeltaChat operates in the mode where messages are left on the server, there probably wouldn't be any data loss at all, and in the other mode, it would be less loss than without anything to go back to.

Jul 22 '25 23:07 ell1e

I wonder if this is also an issue for Android and iOS?

Most Android phones use UFS storage. It may be possible to install the app to sd card, but I am not sure new phones allow it and it may require that app configures this in the manifest. Most phones don't even have SD card slot. On iOS I think it is never possible to install the app to SD card.

Jul 22 '25 23:07 link2xt

Since we have merged #6997 (core 2.2.0) lowest UID is updated as long as some messages are fetched successfully even if connection times out while fetching new messages.

The problem is that no messages are fetched successfully in @ell1e's log, timeout occurs right away, so we won't update the lowest UID even after #6997. And before, prefetching takes ~5 minutes, maybe some old backup was restored. That's why i created #7031.

Jul 23 '25 11:07 iequidoo

Most Android phones use UFS storage

I guess this might be best discussed on the forum, but I do wonder if that makes a difference for fsync handling. I started a discussion here.

Jul 26 '25 19:07 ell1e

one easy workaround could be to keep a copy from the last time the app started to go back to. At least when DeltaChat operates in the mode where messages are left on the server, there probably wouldn't be any data loss at all, and in the other mode, it would be less loss than without anything to go back to.

This should be solved at fs level, adding crutches to Delta Chat or even to SQLite won't solve it reliably anyway, and you may have other software. But i'm not aware of special FSes for such SD cards, i'm not even sure that ext4 with data journaling will solve it.

Jul 27 '25 17:07 iequidoo

I would recommend responding on the forum thread (since I guess this issue is now primarily about the refetching of messages). But I don't think this is solvable at the FS level.

Jul 27 '25 17:07 ell1e

For what it's worth, in 2.9.0 it still seems to fail with syncing. (No idea if the commit made it into that version or not.)

Aug 03 '25 17:08 ell1e

The commit is only in an unmerged PR #7031. But it won't fix the issue completely because the reason of timeouts in the log is still unclear. But at least there should be a progress in syncing messages over time, i expect that 500 messages can be prefetched and at least some of them downloaded before a timeout occurs.

Aug 03 '25 17:08 iequidoo

Sadly, this is still happening and requires me to nuke the entire profile every two weeks or so. The log files themselves also seem like a problem, since when delta chat is stuck in this state it easily produces 30MiB+ of log data in under an hour.

Sep 11 '25 10:09 ell1e

Could you share the new log just in case? I think the full log isn't needed, when a timeout or another network error occurs (which should be WARNING in the log), the reason should be either clear or not at all. Also you can try #7031 (rebased on main) to see how much it helps, for Desktop it's easy, just replace deltachat-rpc-server binary.

Sep 11 '25 11:09 iequidoo

#7031 is merged, so this can be rechecked after the new release is done. If timeouts continue to happen, we can create a separate issue for that, but i expect that Delta Chat shouldn't stuck when fetching new messages anymore even if timeouts happen

Oct 02 '25 19:10 iequidoo

Closing this as 2.17 is released with #7031 merged. For 2.17+ better open a new issue with a new log if this continues to be a problem.

Oct 05 '25 15:10 link2xt