Mailpile icon indicating copy to clipboard operation
Mailpile copied to clipboard

Message not found - location pointers lost from Metadata index

Open JackDca opened this issue 3 years ago • 7 comments

I am experiencing some problems as described below and would like to determine if this is isolated to my Maipile instance or if others have seen the same symptoms.

My "production" Mailpile instance is from Github master, runs on Debian Stretch, indexes 60,000+ messages and receives incoming messages from three IMAP servers. Normally it runs 24 hours a day for weeks or months at a time.

Symptoms

Sometimes, when clicking on an apparently normal line in the search results window, the message body does not display. (Normal meaning that the sender, subject and date fields contain valid data, and when the mouse is hovered over the subject, a valid snippet of the message body is displayed.) In place of the message body, in the search results window, a "Message not found" error is displayed like this example:

MessageNotFound

This has happened at least three times over the last few months with this Mailpile instance. When it happens, large numbers of messages (but not all) show the same behaviour. Mailpile continues to accept incoming emails and at least some of these are displayed normally.

Recovery

I've recovered by quitting Mailpile, restoring the homedir from a 1- to 2- week old backup (i.e. before the symptoms were seen), then restarting. This works because recent messages are still on the IMAP servers.

I've also tried restarting, then doing rescan all at the CLI interface. This appears to work (after a long wait !!!) but complete verification is difficult. Specifically, I don't know if messages that have been deleted from the IMAP server (which I do manually every few months) are recovered.

At best this is annoying and at worst it's possible that some messages are inaccessible or completely lost. This will shake the faith of new Mailpile users.

Details

The last time this happened I saved the homedir and decrypted the metadata index mailpile.idx. Analysis shows that the messages displayed as described above have blank values in the second field of the metadata index entry. This is the MSG_PTRS field, which points to the mailbox and the name of the file containing the email body. This is consistent with the "Message not found" error: in order to display the body of the message, Mailpile would need to read the file containing the message, but there's no file name in the MSG_PTRS field.

The complete mailpile.idx is written from beginning to end infrequently. When Mailpile is started, it appends an #-----APPENDED SECTION----- entry to mailpile.idx. Then, when a metadata index entry is changed in memory, the modified version of the entry is appended to mailpile.idx. Consequently mailpile.idx contains all the recent history of changes to index entries. If Mailpile is shut down and restarted, mailpile.idx is read and the last version of an entry overwrites all previous versions with the same Message Index ID. (See Data structures: Metadata index )

Some metadata index entries have blank MSG_PTRS fields for valid reasons. These index entries typically have a subject field which is blank or (missing message) and a snippet field containing {G} or {M}. This is not the case for the errors described here. A grep of the decrypted mailpile.idx counted about 16,000 entries with blank MSG_PTRS fields that did not have {G} or {M} in the snippet field. All of these appeared after the last #-----APPENDED SECTION----- entry in mailpile.idx.

A check of several entries having the described symptoms showed that in each, a version containing a valid MSG_PTRS field existed but was later overwritten by a version with the same MSG_MID but a blank MSG_PTRS field; other fields of the entry appear to be unchanged; the version with the valid MSG_PTRS field preceded the last #-----APPENDED SECTION----- entry. A check of the mail subdirectory of the homedir showed that the original MSG_PTRS fields were valid; the files still existed even after these symptoms occurred; when the files were decrypted they appeared to contain the correct messages. The files were also listed in the appropriate pickled-mailbox... files.

A spot check of 16 messages having blank MSG_PTRS fields and a non-trivial snippet field in the original mailpile.idx showed that, 24 hours after a rescan all command, their mailpile.idx entries had a non-blank MSG_PTRS field, its value being the same as the one before the value was lost. After the rescan all there remained about 600 entries with blank MSG_PTRS fields and non-trivial snippet, but for a spot check showed that there was a later entry with a non-blank MSG-PTRS field. These spot checks suggest that the rescan all enabled a complete recovery, but do not prove that.

At best this problem is a serious nuisance (it takes a long time to a rescan all on 60,000 messages) and there is still the possibility of data loss

Any comments?

JackDca avatar Mar 05 '21 16:03 JackDca

Long time I did not used Mailpile, but I did have the exact same symptoms. Can remember for sure if the messages got deleted from imap, but I don't think so.

I cannot remember if I was able to recover them with a full rescan, but I think that no, and anyway, the bug was too frequent and the rescan too long for that use.

JocelynDelalande avatar Mar 11 '21 11:03 JocelynDelalande

Thanks for the comments @JocelynDelalande.

I can't figure out what could cause this problem. But, I've added some code to search.py which will do a traceback.print_stack() when an entry is written to INDEX that has a snippet but no file pointer. I'm redirecting all the stderr output to a file. If the problem happens again I'll have a lot more to work with than I do now. Could take months though!

JackDca avatar Mar 11 '21 15:03 JackDca

Bingo! The problem occurred again (actually on 2021-03-23 but I've only now had time to look at the results). I had added code to do a traceback when a metadata entry was written with a blank location but a non-trivial "snippet'. This produced about 20000 tracebacks which all look like this:

  14040   File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
  14041     self.__bootstrap_inner()
  14042   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  14043     self.run()
  14044   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1094, in run
  14045     self.sync_mail()
  14046   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 319, in sync_mail
  14047     stop_after=this_batch)
  14048   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1010, in rescan_mailbox
  14049     **scan_mailbox_args)
  14050   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 565, in scan_mailbox
  14051     self._remove_location(session, msg_ptr)
  14052   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 386, in _remove_location
  14053     self.set_msg_at_idx_pos(msg_idx_pos, msg_info)
  14054   File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 1717, in set_msg_at_idx_pos
  14055     traceback.print_stack()

The decision to delete the filepointer (location) is made in mailpile.search.scan_mailbox() in this code:

553            # Figure out which messages exist at all, and remove stale pointers.
554            # This happens last and in a locked section, because other threads may
555            # have added messages while we were busy with other things.
556            with self._lock:
557                messages = sorted(mbox.keys())  # Update this, in case it changed
558                existing_ptrs = set()
559                for ui in range(0, len(messages)):
560                    msg_ptr = mbox.get_msg_ptr(mailbox_idx, messages[ui])
561                    existing_ptrs.add(msg_ptr)
562                for msg_ptr in self.PTRS.keys():
563                    if (msg_ptr[:MBX_ID_LEN] == mailbox_idx and
564                            msg_ptr not in existing_ptrs):
565                        self._remove_location(session, msg_ptr)
566                        updated += 1

I have never worked on this part of the code so I don't know why it is causing the problem.

Best option is to repeat the process by adding some code to print some variables at this point and wait for the problem to happen again.

JackDca avatar Apr 05 '21 15:04 JackDca

Update: This problem occurred again on 2021-04-16. There were about 16554 metadata index lines having a snippet but no message location. Unfortunately, due to an error in my traceback code, I got no useful information from this occurrence.

A rescan all appeared to recover the data, but I have not verified that with certainty. I fixed my error (I hope) and am awaiting the next occurrence. As of the date of this post, there has been none. That's 25 days so far.

Of course, the Heisenberg Uncertainty Principle may come into play here. Is my traceback code affecting the program in a way which prevents the problem from being manifested?

Patience!!

JackDca avatar May 11 '21 13:05 JackDca

Update: The problem occurred again on 2021-06-03. Once again I shut down, saved the homedir, restarted and ran rescan all which appeared to recover all the emails (though have not proved that).

This time debug code produced some useful information. Debug information was printed from search.py function scan_mailbox whenever Mailpile deleted a message location pointer for a message that had MSG_PTRS (pointer to the file containing the message) with length < 20 (i.e. 1 or 0 message location pointers ) and MSG_BODY (the "snippet") with length > 3. Clearly the lone message pointer to a non-empty message should not be deleted. But this occurred 21293 times. In all cases the message was from the same mailbox, which was the INBOX of one of my three email accounts. That mailbox now has about 21500 emails so it looks like most or all of the emails in that INBOX were affected.. The code block below shows where the debug output was placed. The debug output showed that messages had zero length. The list messages is set twice at line 476 and 556 in this function with the statement messages = sorted(mbox.keys()). Debug code also captured the length of messages before the line 556 update. In all cases the length was 0 before and after the line 556 update.

The next step is to put debug code in mailbox_opener() and mbox.update_toc() to try to determine why one or the other produces the zero-length mbox.

These details are provided in the hope that they may be useful to someone with a better understanding of the code than mine.

Location of the debug code:

            # Figure out which messages exist at all, and remove stale pointers.
            # This happens last and in a locked section, because other threads may
            # have added messages while we were busy with other things.
            with self._lock:
                existing_ptrs = set()
                for ui in range(0, len(messages)):
                    msg_ptr = mbox.get_msg_ptr(mailbox_idx, messages[ui])
                    existing_ptrs.add(msg_ptr)
                for msg_ptr in self.PTRS.keys():
                    if (msg_ptr[:MBX_ID_LEN] == mailbox_idx and
                            msg_ptr not in existing_ptrs):
                        # ********* Debug output placed here **********
                        self._remove_location(session, msg_ptr)
                        updated += 1

The traceback (occurred 21293 times)

File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
  self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  self.run()
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1094, in run
  self.sync_mail()
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 319, in sync_mail
  stop_after=this_batch)
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/mail_source/__init__.py", line 1010, in rescan_mailbox
  **scan_mailbox_args)
File "/home/jack/Projects/MPJack/Code/Mailpile/mailpile/search.py", line 590, in scan_mailbox
  traceback.print_stack()

JackDca avatar Jun 16 '21 15:06 JackDca

Update:

I inserted a statement at line 557 of search.py so that the code that deletes message location pointers is skipped if the mbox is zero length (which can't be correct). I've run Mailpile for one month and the "Message not found" problem has not occurred in that time so the kludgy fix is likely working (though I've gone for more than a month between "Message not found" occurrences before). I've also logged a lot (60 MB) of debug printout in an effort to find out why the mbox is zero length. Next step is to find some time to analyze the debug output! Not sure when I will get to it.

        if not lazy:
            # Figure out which messages exist at all, and remove stale pointers.
            # This happens last and in a locked section, because other threads may
            # have added messages while we were busy with other things.
            with self._lock:
                messages = sorted(mbox.keys())  # Update this, in case it changed
                if(len(messages) > 0):                    # *** DEBUG kludge!
                    existing_ptrs = set()


JackDca avatar Jul 17 '21 14:07 JackDca

(Almost a year later.) The above mentioned patch appears to prevent the wholesale deletion of message pointers so that emails are not lost. But it also appears to prevent messages from being deleted.

It appears that the problem may be caused by concurrent access to messages by multiple threads. In an attempt to find a workaround and further narrow down the problem I added code to prevent some threads from being launched, specifically those which do tag automation. After logging several months of debug data, it appears that the len(messages) == 0 condition occurs just as frequently even with those threads disabled. So I am no wiser and I am out of ideas.

JackDca avatar Jun 24 '22 14:06 JackDca