parsedmarc
parsedmarc copied to clipboard
Multithread the mail-message processing
I have been experimenting with multithreading the mail-message processing. Each mail message in a batch is processed "in parallel" so that when one thread is waiting for a DNS timeout or other I/O, another one can keep on processing. I tried multiprocessing too, but could not work out how to share the cache files instead of duplicating (and diluting) them between the processes. On my system at least, the CPU is not a bottleneck, so full multiprocessing does not provide much more benefit.
Let me know what you think. There is probably more cleanup to be done. Maybe a little more restructuring to handle saving the results in the thread. This may also play into the goal of saving the results before moving or deleting the mail message.
The code is working fine, at least for my use case (graph mailbox). I have run the following benchmarks to look at the performance.
Significant Options:
[general]
n_procs = 4
dns_timeout = 10.0
…
[msgraph]
…
[mailbox]
reports_folder = Inbox/OneDaySample
batch_size = 50
archive_folder = Archive/OneDaySample
The one-day sample has 428 mail messages with 513,330 reports from 4/10/2024.
Version 8.10.3 + #509 (multithreading change)
Run with batch_size = 50 and dns_timeout = 10.0
Elapsed time: 02:08:41.
Rerun with batch_size = 50 and dns_timeout = 6.0
Elapsed time: 01:09:03 * (Can't explain the outlier.)
Elapsed time: 00:39:08
Elapsed time: 00:37:26
Elapsed time: 00:38:25
Rerun with batch_size = 500 and dns_timeout = 6.0
Elapsed time: 00:22:54
Elapsed time: 00:20:04
Version 8.10.3 w/o #509 (cache improvements only)
Run with batch_size = 50
Also, effectively, dns_timeout = 6.0 because of a bug in propagating the setting.
Elapsed time: 01:08:15
Rerun with batch_size = 500
Elapsed time: 00:46:39
Version 8.6.4 (before cache and multithreading changes)
With batch_size = 50
Elapsed time: 21:57:12 (Yes, almost a day to process a day's mail messages.)
Hi,
Sorry I'm just getting around to addressing PRs/ Can you rebase this PR and fix the conflicts?
I have been off on sick leave. I should be back to work in the next few weeks, and I will look at it then.
I am currently testing the rebase of my changes. It is running without error, but it will take some more looking to make sure I didn't disturb any of the other changes.
Codecov Report
Attention: Patch coverage is 15.15152% with 112 lines in your changes missing coverage. Please review.
Project coverage is 60.62%. Comparing base (
d6128ea) to head (066138c). Report is 36 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| parsedmarc/__init__.py | 3.47% | 111 Missing :warning: |
| parsedmarc/mail/graph.py | 0.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #509 +/- ##
==========================================
+ Coverage 59.88% 60.62% +0.73%
==========================================
Files 12 12
Lines 1578 1671 +93
==========================================
+ Hits 945 1013 +68
- Misses 633 658 +25
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@rodpayne I'm curious about your thoughts on this https://github.com/nhairs/parsedmarc-fork/issues/1
@rodpayne I'm curious about your thoughts on this nhairs/parsedmarc-fork#1
I am fine with Option A. I posted some comments in that discussion.
Closing per the discussion here