Mailpile
Mailpile copied to clipboard
Message not found - location pointers lost from Metadata index
I am experiencing some problems as described below and would like to determine if this is isolated to my Maipile instance or if others have seen the same symptoms.
My "production" Mailpile instance is from Github master, runs on Debian Stretch, indexes 60,000+ messages and receives incoming messages from three IMAP servers. Normally it runs 24 hours a day for weeks or months at a time.
Symptoms
Sometimes, when clicking on an apparently normal line in the search results window, the message body does not display. (Normal meaning that the sender, subject and date fields contain valid data, and when the mouse is hovered over the subject, a valid snippet of the message body is displayed.) In place of the message body, in the search results window, a "Message not found" error is displayed like this example:
This has happened at least three times over the last few months with this Mailpile instance. When it happens, large numbers of messages (but not all) show the same behaviour. Mailpile continues to accept incoming emails and at least some of these are displayed normally.
Recovery
I've recovered by quitting Mailpile, restoring the homedir
from a 1- to 2- week old backup (i.e. before the symptoms were seen), then restarting. This works because recent messages are still on the IMAP servers.
I've also tried restarting, then doing rescan all
at the CLI interface. This appears to work (after a long wait !!!) but complete verification is difficult. Specifically, I don't know if messages that have been deleted from the IMAP server (which I do manually every few months) are recovered.
At best this is annoying and at worst it's possible that some messages are inaccessible or completely lost. This will shake the faith of new Mailpile users.
Details
The last time this happened I saved the homedir
and decrypted the metadata index mailpile.idx
. Analysis shows that the messages displayed as described above have blank values in the second field of the metadata index entry. This is the MSG_PTRS field, which points to the mailbox and the name of the file containing the email body. This is consistent with the "Message not found" error: in order to display the body of the message, Mailpile would need to read the file containing the message, but there's no file name in the MSG_PTRS field.
The complete mailpile.idx is written from beginning to end infrequently. When Mailpile is started, it appends an #-----APPENDED SECTION-----
entry to mailpile.idx
. Then, when a metadata index entry is changed in memory, the modified version of the entry is appended to mailpile.idx
. Consequently mailpile.idx
contains all the recent history of changes to index entries. If Mailpile is shut down and restarted, mailpile.idx
is read and the last version of an entry overwrites all previous versions with the same Message Index ID. (See Data structures: Metadata index )
Some metadata index entries have blank MSG_PTRS fields for valid reasons. These index entries typically have a subject field which is blank or (missing message)
and a snippet field containing {G}
or {M}
. This is not the case for the errors described here. A grep of the decrypted mailpile.idx
counted about 16,000 entries with blank MSG_PTRS fields that did not have {G}
or {M}
in the snippet field. All of these appeared after the last #-----APPENDED SECTION-----
entry in mailpile.idx
.
A check of several entries having the described symptoms showed that in each, a version containing a valid MSG_PTRS field existed but was later overwritten by a version with the same MSG_MID but a blank MSG_PTRS field; other fields of the entry appear to be unchanged; the version with the valid MSG_PTRS field preceded the last #-----APPENDED SECTION-----
entry. A check of the mail
subdirectory of the homedir
showed that the original MSG_PTRS fields were valid; the files still existed even after these symptoms occurred; when the files were decrypted they appeared to contain the correct messages. The files were also listed in the appropriate pickled-mailbox...
files.
A spot check of 16 messages having blank MSG_PTRS fields and a non-trivial snippet field in the original mailpile
.idx showed that, 24 hours after a rescan all
command, their mailpile.idx
entries had a non-blank MSG_PTRS field, its value being the same as the one before the value was lost. After the rescan all
there remained about 600 entries with blank MSG_PTRS fields and non-trivial snippet, but for a spot check showed that there was a later entry with a non-blank MSG-PTRS field. These spot checks suggest that the rescan all
enabled a complete recovery, but do not prove that.
At best this problem is a serious nuisance (it takes a long time to a rescan all on 60,000 messages) and there is still the possibility of data loss
Any comments?
Long time I did not used Mailpile, but I did have the exact same symptoms. Can remember for sure if the messages got deleted from imap, but I don't think so.
I cannot remember if I was able to recover them with a full rescan, but I think that no, and anyway, the bug was too frequent and the rescan too long for that use.
Thanks for the comments @JocelynDelalande.
I can't figure out what could cause this problem. But, I've added some code to search.py
which will do a traceback.print_stack()
when an entry is written to INDEX
that has a snippet but no file pointer. I'm redirecting all the stderr
output to a file. If the problem happens again I'll have a lot more to work with than I do now. Could take months though!
Bingo! The problem occurred again (actually on 2021-03-23 but I've only now had time to look at the results). I had added code to do a traceback when a metadata entry was written with a blank location but a non-trivial "snippet'. This produced about 20000 tracebacks which all look like this:
14040 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
14041 self.__bootstrap_inner()
14042 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
14043 self.run()
14044 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1094, in run
14045 self.sync_mail()
14046 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 319, in sync_mail
14047 stop_after=this_batch)
14048 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1010, in rescan_mailbox
14049 **scan_mailbox_args)
14050 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 565, in scan_mailbox
14051 self._remove_location(session, msg_ptr)
14052 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 386, in _remove_location
14053 self.set_msg_at_idx_pos(msg_idx_pos, msg_info)
14054 File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 1717, in set_msg_at_idx_pos
14055 traceback.print_stack()
The decision to delete the filepointer (location) is made in mailpile.search.scan_mailbox() in this code:
553 # Figure out which messages exist at all, and remove stale pointers.
554 # This happens last and in a locked section, because other threads may
555 # have added messages while we were busy with other things.
556 with self._lock:
557 messages = sorted(mbox.keys()) # Update this, in case it changed
558 existing_ptrs = set()
559 for ui in range(0, len(messages)):
560 msg_ptr = mbox.get_msg_ptr(mailbox_idx, messages[ui])
561 existing_ptrs.add(msg_ptr)
562 for msg_ptr in self.PTRS.keys():
563 if (msg_ptr[:MBX_ID_LEN] == mailbox_idx and
564 msg_ptr not in existing_ptrs):
565 self._remove_location(session, msg_ptr)
566 updated += 1
I have never worked on this part of the code so I don't know why it is causing the problem.
Best option is to repeat the process by adding some code to print some variables at this point and wait for the problem to happen again.
Update: This problem occurred again on 2021-04-16. There were about 16554 metadata index lines having a snippet but no message location. Unfortunately, due to an error in my traceback code, I got no useful information from this occurrence.
A rescan all
appeared to recover the data, but I have not verified that with certainty. I fixed my error (I hope) and am awaiting the next occurrence. As of the date of this post, there has been none. That's 25 days so far.
Of course, the Heisenberg Uncertainty Principle may come into play here. Is my traceback code affecting the program in a way which prevents the problem from being manifested?
Patience!!
Update: The problem occurred again on 2021-06-03. Once again I shut down, saved the homedir, restarted and ran rescan all
which appeared to recover all the emails (though have not proved that).
This time debug code produced some useful information. Debug information was printed from search.py
function scan_mailbox
whenever Mailpile deleted a message location pointer for a message that had MSG_PTRS
(pointer to the file containing the message) with length < 20 (i.e. 1 or 0 message location pointers ) and MSG_BODY
(the "snippet") with length > 3. Clearly the lone message pointer to a non-empty message should not be deleted. But this occurred 21293 times. In all cases the message was from the same mailbox, which was the INBOX
of one of my three email accounts. That mailbox now has about 21500 emails so it looks like most or all of the emails in that INBOX
were affected.. The code block below shows where the debug output was placed. The debug output showed that messages
had zero length. The list messages
is set twice at line 476 and 556 in this function with the statement messages = sorted(mbox.keys())
. Debug code also captured the length of messages
before the line 556 update. In all cases the length was 0 before and after the line 556 update.
The next step is to put debug code in mailbox_opener()
and mbox.update_toc()
to try to determine why one or the other produces the zero-length mbox
.
These details are provided in the hope that they may be useful to someone with a better understanding of the code than mine.
Location of the debug code:
# Figure out which messages exist at all, and remove stale pointers.
# This happens last and in a locked section, because other threads may
# have added messages while we were busy with other things.
with self._lock:
existing_ptrs = set()
for ui in range(0, len(messages)):
msg_ptr = mbox.get_msg_ptr(mailbox_idx, messages[ui])
existing_ptrs.add(msg_ptr)
for msg_ptr in self.PTRS.keys():
if (msg_ptr[:MBX_ID_LEN] == mailbox_idx and
msg_ptr not in existing_ptrs):
# ********* Debug output placed here **********
self._remove_location(session, msg_ptr)
updated += 1
The traceback (occurred 21293 times)
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1094, in run
self.sync_mail()
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 319, in sync_mail
stop_after=this_batch)
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1010, in rescan_mailbox
**scan_mailbox_args)
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 590, in scan_mailbox
traceback.print_stack()
Update:
I inserted a statement at line 557 of search.py so that the code that deletes message location pointers is skipped if the mbox
is zero length (which can't be correct). I've run Mailpile for one month and the "Message not found" problem has not occurred in that time so the kludgy fix is likely working (though I've gone for more than a month between "Message not found" occurrences before). I've also logged a lot (60 MB) of debug printout in an effort to find out why the mbox
is zero length. Next step is to find some time to analyze the debug output! Not sure when I will get to it.
if not lazy:
# Figure out which messages exist at all, and remove stale pointers.
# This happens last and in a locked section, because other threads may
# have added messages while we were busy with other things.
with self._lock:
messages = sorted(mbox.keys()) # Update this, in case it changed
if(len(messages) > 0): # *** DEBUG kludge!
existing_ptrs = set()
(Almost a year later.) The above mentioned patch appears to prevent the wholesale deletion of message pointers so that emails are not lost. But it also appears to prevent messages from being deleted.
It appears that the problem may be caused by concurrent access to messages
by multiple threads. In an attempt to find a workaround and further narrow down the problem I added code to prevent some threads from being launched, specifically those which do tag automation. After logging several months of debug data, it appears that the len(messages) == 0 condition occurs just as frequently even with those threads disabled. So I am no wiser and I am out of ideas.