parsedmarc icon indicating copy to clipboard operation
parsedmarc copied to clipboard

Multithread the mail-message processing

Open rodpayne opened this issue 1 year ago • 3 comments

I have been experimenting with multithreading the mail-message processing. Each mail message in a batch is processed "in parallel" so that when one thread is waiting for a DNS timeout or other I/O, another one can keep on processing. I tried multiprocessing too, but could not work out how to share the cache files instead of duplicating (and diluting) them between the processes. On my system at least, the CPU is not a bottleneck, so full multiprocessing does not provide much more benefit.

Let me know what you think. There is probably more cleanup to be done. Maybe a little more restructuring to handle saving the results in the thread. This may also play into the goal of saving the results before moving or deleting the mail message.

rodpayne avatar Apr 10 '24 05:04 rodpayne

The code is working fine, at least for my use case (graph mailbox). I have run the following benchmarks to look at the performance.

Significant Options:

[general]
n_procs = 4
dns_timeout = 10.0
…
[msgraph]
…
[mailbox]
reports_folder = Inbox/OneDaySample
batch_size = 50
archive_folder = Archive/OneDaySample

The one-day sample has 428 mail messages with 513,330 reports from 4/10/2024.

Version 8.10.3 + #509 (multithreading change)

Run with batch_size = 50 and dns_timeout = 10.0 Elapsed time: 02:08:41.

Rerun with batch_size = 50 and dns_timeout = 6.0 Elapsed time: 01:09:03 * (Can't explain the outlier.) Elapsed time: 00:39:08 Elapsed time: 00:37:26 Elapsed time: 00:38:25

Rerun with batch_size = 500 and dns_timeout = 6.0 Elapsed time: 00:22:54 Elapsed time: 00:20:04

Version 8.10.3 w/o #509 (cache improvements only)

Run with batch_size = 50 Also, effectively, dns_timeout = 6.0 because of a bug in propagating the setting. Elapsed time: 01:08:15

Rerun with batch_size = 500 Elapsed time: 00:46:39

Version 8.6.4 (before cache and multithreading changes)

With batch_size = 50 Elapsed time: 21:57:12 (Yes, almost a day to process a day's mail messages.)

rodpayne avatar Apr 13 '24 16:04 rodpayne

Hi,

Sorry I'm just getting around to addressing PRs/ Can you rebase this PR and fix the conflicts?

seanthegeek avatar May 22 '24 12:05 seanthegeek

I have been off on sick leave. I should be back to work in the next few weeks, and I will look at it then.

rodpayne avatar Jun 20 '24 18:06 rodpayne

I am currently testing the rebase of my changes. It is running without error, but it will take some more looking to make sure I didn't disturb any of the other changes.

rodpayne avatar Aug 13 '24 20:08 rodpayne

Codecov Report

Attention: Patch coverage is 15.15152% with 112 lines in your changes missing coverage. Please review.

Project coverage is 60.62%. Comparing base (d6128ea) to head (066138c). Report is 36 commits behind head on master.

Files with missing lines Patch % Lines
parsedmarc/__init__.py 3.47% 111 Missing :warning:
parsedmarc/mail/graph.py 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage   59.88%   60.62%   +0.73%     
==========================================
  Files          12       12              
  Lines        1578     1671      +93     
==========================================
+ Hits          945     1013      +68     
- Misses        633      658      +25     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 14 '24 20:08 codecov[bot]

@rodpayne I'm curious about your thoughts on this https://github.com/nhairs/parsedmarc-fork/issues/1

seanthegeek avatar Aug 25 '24 01:08 seanthegeek

@rodpayne I'm curious about your thoughts on this nhairs/parsedmarc-fork#1

I am fine with Option A. I posted some comments in that discussion.

rodpayne avatar Aug 28 '24 19:08 rodpayne

Closing per the discussion here

seanthegeek avatar Aug 31 '24 16:08 seanthegeek