mail icon indicating copy to clipboard operation
mail copied to clipboard

Lazy and partial message cache

Open ChristophWurst opened this issue 3 years ago • 11 comments

Is your feature request related to a problem? Please describe.

As a user of the Mail app I notice that the first use experience is slow. That is because the app first indexes all my emails before I'm able to access a mailbox.

From a technical PoV we do this because experience has shown that IMAP search is not always reliable, especially if one wants to sort messages by their date. This feature depends on IMAP capabilities that are not always available. As a consequence, Horde falls back to a client-side pagination algorithm that fetches a full mailbox, sorts locally and then fetches the details of the calculated page. This trickled down as slow performance for our app.

Moreover ever connection to IMAP has a latency penalty for our web app as a new connection needs to be established, authentication happens, etc. A classic desktop client can leave the connection open.

Describe the solution you'd like

Relax the way the message cache works. Do not index all messages at once before we give users access to the mailbox.

Without concrete technical ideas in mind, we need to match some acceptance criteria

  1. There must be an efficient way to fetch the latest x threads in a mailbox (not just messages).
  2. There must be an efficient way to build message threads without having all message data available locally. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes. As in, even combine messages from Inbox and Sent so that threads appear like a conversation in a chat.
  3. There must be an efficient way to access the input data we need for the importance classifier training.

Describe alternatives you've considered

N/a

Additional context

No response

ChristophWurst avatar Dec 15 '22 07:12 ChristophWurst

This feature depends on IMAP capabilities that are not always available

What about integrating Nextcloud with only one officially supported IMAP server, with an optimized configuration that works well with Nextcloud email app? Maybe with an admin webui to manage email accounts directly in Nextcloud?

There must be an efficient way to build message threads without having all message data available locally. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes

An external search engine may help here. If not Elasticsearch (license issues etc.), something like that.

alpianon avatar Jan 09 '23 21:01 alpianon

2. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes

This depends on the server configuration. With Dovecot virtual plugin you can setup an \All mailbox and then all messages in thread can be fetched.

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

For a faster fetching you can call STATUS or SELECT that returns the amount of messages in a mailbox. Then use SORT, SEARCH, etc. for last N messages. For example there are 5000 messages and you get the last 100: SORT REVERSE DATE 4900:*

It's always best to fetch more ID's then the pagination because message ID's are not related to the sent date. This is still not fool proof when someone moves old messages to other folders and the old messages get a new higher ID.

More complex is sorting by FROM, SUBJECT or SIZE because then all messages should be analyzed.

the-djmaze avatar Jan 10 '23 08:01 the-djmaze

With Dovecot virtual plugin you can setup an \All mailbox and then all messages in thread can be fetched.

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

Actually, if one uses dovecot with fts-elastic plugin, speed is not a problem, even with hundreds thousand of messages in between. But it currently cannot search in virtual folders https://github.com/filiphanes/fts-elastic/issues/19 :disappointed: while apparenlty solr plugin works, instead. I need to try it

alpianon avatar Jan 10 '23 09:01 alpianon

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

I did test those scenarios and it's actually not a problem for the threading algorithm itself. It's relatively fast. The problem is rather that you need to fetch a lot of data to run the algorithm.

ChristophWurst avatar Jan 10 '23 09:01 ChristophWurst

There must be an efficient way to fetch the latest x threads in a mailbox (not just messages)

To me that is still the biggest blocker. Finding the x latest messages is solvable with search. Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

ChristophWurst avatar Jan 16 '23 11:01 ChristophWurst

Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

There are two headers in a MIME message for this:

  • References
  • In-Reply-To

https://www.rfc-editor.org/rfc/rfc5322#section-3.6.4

Although they are optional, they should be there. If not, the sender might not want the message be referenced (or does but something screwed up).

The complex part is: find all thread messages in all mailboxes But is that important? Mostly when you reply you are quoting the parent message and the recipient will receive his text and your comments.

the-djmaze avatar Jan 17 '23 09:01 the-djmaze

Thank you @the-djmaze. I am aware of the headers. I wrote the threading algorithm for this app.

Mostly when you reply you are quoting the parent message and the recipient will receive his text and your comments.

Fair point but along threads you will lose attachments, can't verify signed messages once they are quoted and so on. So I think there are good reasons to still show the thread as conversation, even though most text is preserved in replies.

ChristophWurst avatar Jan 17 '23 09:01 ChristophWurst

There must be an efficient way to fetch the latest x threads in a mailbox (not just messages)

To me that is still the biggest blocker. Finding the x latest messages is solvable with search. Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

Not sure if you are aware, so I wanted to mention what I think the Dovecot solution for this is, which is virtual folders (https://doc.dovecot.org/configuration_manual/virtual_plugin/). See the examples for a conversion view, "which shows all threads that have messages in INBOX, but shows all messages in the thread regardless of in what mailbox they physically exist in". I don't know about other IMAP servers, but it may be worth having this as option at least for Dovecot users, since it should be much more efficient.

chbusold avatar Mar 03 '23 08:03 chbusold

That is nice, but like you say, specific to the IMAP server. We can't generally rely on a \all mailbox and therefore would have to implement threading twice.

ChristophWurst avatar Mar 03 '23 08:03 ChristophWurst

The reduce the amount of data we have to write to the database cache it could be an interesting idea to remove the recipients table.

Pro:

  • Average message has at least two entries for the sender and recipient. Messages to groups have one row for the sender and one for each recipient. We can save at least two INSERT statements for cached messages.

Con:

  • When showing messages we have to go to IMAP to fetch the recipients. This roundtrip can cost 150ms because that is a typical time it takes for IMAP to log in.
  • Searches in recipients are potentially slower because they are performed on IMAP, not the indexed, local database.

ChristophWurst avatar Aug 11 '23 06:08 ChristophWurst

I'm trying to escape from Gmail. But I still have my personal account where I need to look into 800K email. I tried to connected it via IMAP, but I couldn't use it for check last emails. Cause everything was stuck.

There should be a way to show at least the unread emails in few secs. An option like in thunderbird (if I'm right ,otherwise it's in the mailcow's SoGo client) to synch only the last x days.

turboyz avatar Oct 09 '24 21:10 turboyz

@alpianon search in virtual mailbox in fts-elastic is fixed

filiphanes avatar Dec 02 '24 17:12 filiphanes