did
did copied to clipboard
Create a Public Inbox Plugin
Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.
This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.
This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:
- It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
- Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?
- I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?
Thanks!
Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.
That's nice
This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.
This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:
- It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
- Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?
I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.
- I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?
Makes sense to me using the markdown format for adding links.
Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.
That's nice
Yeah, thanks for working on this improvement!
- It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
So far it seems no did
plugin uses caching. I'd say we could start simple: Either create the first proof-of-concept without caching at all or implement simple cache in memory (just store and index all threads which have been already fetched before).
- Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?
I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.
Agreed: Choosing a narrow since/until interval on a public public-inbox instance should be completely sufficient I'd say.
- I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?
Makes sense to me using the markdown format for adding links.
Sounds good to me to provide the rich output including links in the markdown
format and use something concise (just the subject?) for the text output.
I've just updated the branch taking your comments into account:
- I've added caching, reducing the runtime on my machine, with last week data, from 2 minutes to 17s. It's been mostly about reducing the network requests to a minimum, so hopefully we don't do them twice.
- I've added markdown support too. I still like the old output format so I kept it, but let me know if it's an issue
- I've added a couple of unit tests to make sure we get a somewhat sane list of messages.
Thanks much for the changes! Looks very good. I'm proposing just a couple of rather minor adjustments in 339d788. Could you please have a look if the changes are ok? Included is the
--verbose
mode for showing the full url of the message. I would like to keep this formatting (e.g. one-line per item by default) consistent across all plugins.
It looks great to me, thanks!
@mripard thanks for this PR. Really useful!
I used this plugin for a while without issues, but I just hit one today while running did --since 2024-01-08 --until 2024-01-14
.
My configuration is:
[general]
email = "Stefano Garzarella" <[email protected]>
width = 160
plugins = ~/.did/plugins
[ml]
type = public_inbox
url = https://lore.kernel.org
...
For now I have public_inbox.py
as an external plugin in ~/.did/plugins
, but it is a link to the file contained in this PR.
This is the error:
$ ~/repos/did/bin/did --since 2024-01-08 --until 2024-01-14 > W02.md
Traceback (most recent call last):
File "/home/stefano/.did/plugins/public_inbox.py", line 110, in __get_mbox_from_content
content = gzip.decompress(content)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/gzip.py", line 627, in decompress
if _read_gzip_header(fp) is None:
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/gzip.py", line 456, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'No')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/stefano/repos/did/bin/did", line 42, in <module>
did.cli.main()
File "/usr/lib/python3.12/site-packages/did/cli.py", line 238, in main
user_stats.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
stat.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
stat.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
self.fetch()
File "/home/stefano/.did/plugins/public_inbox.py", line 225, in fetch
self.stats = [
^
File "/home/stefano/.did/plugins/public_inbox.py", line 190, in get_all_threads
self.threads_cache[(since, until)] = self.__fetch_all_threads(since, until)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefano/.did/plugins/public_inbox.py", line 185, in __fetch_all_threads
mbox = self.__get_mbox_from_content(resp.content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefano/.did/plugins/public_inbox.py", line 111, in __get_mbox_from_content
except BadGzipFile:
^^^^^^^^^^^
NameError: name 'BadGzipFile' is not defined
As rapid fix, I applied the following patch, but I'm not sure if it is the right thing to do:
diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index ec8e333..d3df993 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -106,7 +106,10 @@ class PublicInbox(object):
item(self._get_message_url(msg), level=2, options=opt)
def __get_mbox_from_content(self, content: bytes) -> mailbox.mbox:
- content = gzip.decompress(content)
+ try:
+ content = gzip.decompress(content)
+ except gzip.BadGzipFile:
+ return {}
with tempfile.NamedTemporaryFile() as tmp:
tmp.write(content)
@stefano-garzarella Hi, I'm glad it's helpful. I've pushed a new version that should fix it, plus a unit test to cover the failing case (we were ignoring the HTTP request status code and assuming the result was a GZIP archive).
I've also squashed the patch @psss did earlier.
/packit test
Hi, is there anything I can help with to get this merged?
Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:
def __get_thread_root(self, msg: Message) -> Message:
log.debug("Looking for thread root of message %s" % msg.id())
if msg.is_thread_root():
log.debug("Message is thread root already. Returning.")
return msg
parent_id = msg.parent_id()
if parent_id not in self.messages_cache:
root = self.__fetch_thread_root(msg)
log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
return root
while True:
log.debug("Parent is %s" % parent_id)
> assert parent_id in self.messages_cache
E AssertionError
../../did/plugins/public-inbox.py:158: AssertionError
Could you please have a look at it?
Also, I'm not sure it's a good idea to call the plugin file public-inbox.py
as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?
Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:
def __get_thread_root(self, msg: Message) -> Message: log.debug("Looking for thread root of message %s" % msg.id()) if msg.is_thread_root(): log.debug("Message is thread root already. Returning.") return msg parent_id = msg.parent_id() if parent_id not in self.messages_cache: root = self.__fetch_thread_root(msg) log.debug("Found root message %s for message %s" % (root.id(), msg.id())) return root while True: log.debug("Parent is %s" % parent_id) > assert parent_id in self.messages_cache E AssertionError ../../did/plugins/public-inbox.py:158: AssertionError
Could you please have a look at it?
I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)
Also, I'm not sure it's a good idea to call the plugin file
public-inbox.py
as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?
I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?
I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)
:)
I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?
It would be just slightly modifying the way how plugins are imported. The relevant code should be around the load_components()
function:
https://github.com/psss/did/blob/070a535d9250200ddfb8c43f9ae81f999fe51492/did/utils.py#L115
I just pushed a new version with public_inbox.py
, and the bug you mentioned fixed with a unit test to cover the failing condition
Is there anything else I can do to help get this merged?
Thanks for the suggestions, I've added your suggestions or reworked the code to address your concerns.
Thanks for addressing the comments! Now looks good. I've just modified the first commit summary and removed the prefix, see the recommendations for some more context.
@mripard I just received this error today:
did --format markdown --since 2024-05-06 --until 2024-06-12
...
Traceback (most recent call last):
File "/usr/bin/did", line 42, in <module>
did.cli.main()
File "/usr/lib/python3.12/site-packages/did/cli.py", line 239, in main
user_stats.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
stat.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
stat.check()
File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
self.fetch()
File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 231, in fetch
self.stats = [
^
File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 207, in get_all_threads
root = self.__get_thread_root(msg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 153, in __get_thread_root
log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
^^^^^^^
AttributeError: 'NoneType' object has no attribute 'id'
Any idea what is wrong?
Sorry for the late answer.
I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?
Sorry for the late answer.
I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?
Sure, my config contains:
[general]
email = "Stefano Garzarella" <[email protected]>
width = 160
[ml]
type = public-inbox
url = https://lore.kernel.org
...
I had the issue running did --format markdown --since 2024-05-06 --until 2024-06-12
.
For now, I'm using the following workaround:
diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index 888f41c..70e1233 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -150,6 +150,9 @@ class PublicInbox(object):
parent_id = msg.parent_id()
if parent_id not in self.messages_cache:
root = self.__fetch_thread_root(msg)
+ if root is None:
+ return None
+
log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
return root
@@ -164,6 +167,9 @@ class PublicInbox(object):
parent_id = parent.parent_id()
if parent_id not in self.messages_cache:
root = self.__fetch_thread_root(msg)
+ if root is None:
+ return None
+
log.debug(
"Found root message %s for message %s" %
(root.id(), msg.id()))
@@ -205,6 +211,9 @@ class PublicInbox(object):
if not msg.is_thread_root():
root = self.__get_thread_root(msg)
+ if root is None:
+ continue
+
root_id = root.id()
if root_id in found:
log.debug("Root message already encountered... Skip.")
@stefano-garzarella @mripard it seems that issue fixed at https://github.com/mripard/did/pull/1 happens again. I see that the latest code that landed does not check if root is None
before attempting to get root.id()
.
@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.
@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.
@martinezjavier I'm not sure if continue
is the best thing to do in get_all_threads()
if we can't find root.
I've spent some more time looking into it. The offending mail you first reported this issue for is: https://lore.kernel.org/all/AS2P194MB2170A39251AC72302E8F6F459ACB2@AS2P194MB2170.EURP194.PROD.OUTLOOK.COM/ and it looks like the culprit is due to root message not being archived by lore, and thus while the message has a correct in-reply-to chain, lore doesn't give us the root message.
I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.
I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.
yeah, that would be much better! Thanks for working on a proper fix!