did icon indicating copy to clipboard operation
did copied to clipboard

Create a Public Inbox Plugin

Open mripard opened this issue 1 year ago • 12 comments

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?
  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Thanks!

mripard avatar Sep 26 '23 11:09 mripard

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

sandrobonazzola avatar Sep 28 '23 12:09 sandrobonazzola

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

Yeah, thanks for working on this improvement!

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?

So far it seems no did plugin uses caching. I'd say we could start simple: Either create the first proof-of-concept without caching at all or implement simple cache in memory (just store and index all threads which have been already fetched before).

  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

Agreed: Choosing a narrow since/until interval on a public public-inbox instance should be completely sufficient I'd say.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

Sounds good to me to provide the rich output including links in the markdown format and use something concise (just the subject?) for the text output.

psss avatar Nov 10 '23 13:11 psss

I've just updated the branch taking your comments into account:

  • I've added caching, reducing the runtime on my machine, with last week data, from 2 minutes to 17s. It's been mostly about reducing the network requests to a minimum, so hopefully we don't do them twice.
  • I've added markdown support too. I still like the old output format so I kept it, but let me know if it's an issue
  • I've added a couple of unit tests to make sure we get a somewhat sane list of messages.

mripard avatar Dec 13 '23 14:12 mripard

Thanks much for the changes! Looks very good. I'm proposing just a couple of rather minor adjustments in 339d788. Could you please have a look if the changes are ok? Included is the --verbose mode for showing the full url of the message. I would like to keep this formatting (e.g. one-line per item by default) consistent across all plugins.

It looks great to me, thanks!

mripard avatar Jan 04 '24 08:01 mripard

@mripard thanks for this PR. Really useful!

I used this plugin for a while without issues, but I just hit one today while running did --since 2024-01-08 --until 2024-01-14.

My configuration is:

[general]
email = "Stefano Garzarella" <[email protected]>
width = 160
plugins = ~/.did/plugins

[ml]
type = public_inbox
url = https://lore.kernel.org

...

For now I have public_inbox.py as an external plugin in ~/.did/plugins, but it is a link to the file contained in this PR.

This is the error:

$ ~/repos/did/bin/did --since 2024-01-08 --until 2024-01-14 > W02.md
Traceback (most recent call last):
  File "/home/stefano/.did/plugins/public_inbox.py", line 110, in __get_mbox_from_content
    content = gzip.decompress(content)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 627, in decompress
    if _read_gzip_header(fp) is None:
       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 456, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'No')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stefano/repos/did/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 238, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/home/stefano/.did/plugins/public_inbox.py", line 225, in fetch
    self.stats = [
                 ^
  File "/home/stefano/.did/plugins/public_inbox.py", line 190, in get_all_threads
    self.threads_cache[(since, until)] = self.__fetch_all_threads(since, until)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 185, in __fetch_all_threads
    mbox = self.__get_mbox_from_content(resp.content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 111, in __get_mbox_from_content
    except BadGzipFile:
           ^^^^^^^^^^^
NameError: name 'BadGzipFile' is not defined

As rapid fix, I applied the following patch, but I'm not sure if it is the right thing to do:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index ec8e333..d3df993 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -106,7 +106,10 @@ class PublicInbox(object):
                 item(self._get_message_url(msg), level=2, options=opt)
 
     def __get_mbox_from_content(self, content: bytes) -> mailbox.mbox:
-        content = gzip.decompress(content)
+        try:
+            content = gzip.decompress(content)
+        except gzip.BadGzipFile:
+            return {}
 
         with tempfile.NamedTemporaryFile() as tmp:
             tmp.write(content)

stefano-garzarella avatar Jan 22 '24 11:01 stefano-garzarella

@stefano-garzarella Hi, I'm glad it's helpful. I've pushed a new version that should fix it, plus a unit test to cover the failing case (we were ignoring the HTTP request status code and assuming the result was a GZIP archive).

I've also squashed the patch @psss did earlier.

mripard avatar Jan 29 '24 12:01 mripard

/packit test

lukaszachy avatar Feb 01 '24 09:02 lukaszachy

Hi, is there anything I can help with to get this merged?

mripard avatar Feb 26 '24 10:02 mripard

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg
    
        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root
    
        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

psss avatar Mar 07 '24 08:03 psss

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg
    
        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root
    
        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

mripard avatar Mar 07 '24 09:03 mripard

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

:)

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

It would be just slightly modifying the way how plugins are imported. The relevant code should be around the load_components() function:

https://github.com/psss/did/blob/070a535d9250200ddfb8c43f9ae81f999fe51492/did/utils.py#L115

psss avatar Mar 07 '24 09:03 psss

I just pushed a new version with public_inbox.py, and the bug you mentioned fixed with a unit test to cover the failing condition

mripard avatar Mar 07 '24 12:03 mripard

Is there anything else I can do to help get this merged?

mripard avatar Apr 23 '24 06:04 mripard

Thanks for the suggestions, I've added your suggestions or reworked the code to address your concerns.

mripard avatar Apr 29 '24 08:04 mripard

Thanks for addressing the comments! Now looks good. I've just modified the first commit summary and removed the prefix, see the recommendations for some more context.

psss avatar Apr 29 '24 13:04 psss

@mripard I just received this error today:

did --format markdown --since 2024-05-06 --until 2024-06-12
...
Traceback (most recent call last):
  File "/usr/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 239, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 231, in fetch
    self.stats = [
                 ^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 207, in get_all_threads
    root = self.__get_thread_root(msg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 153, in __get_thread_root
    log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
                                                        ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'id'

Any idea what is wrong?

stefano-garzarella avatar May 13 '24 09:05 stefano-garzarella

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

mripard avatar Jun 04 '24 12:06 mripard

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

Sure, my config contains:

[general]
email = "Stefano Garzarella" <[email protected]>
width = 160

[ml]
type = public-inbox
url = https://lore.kernel.org
...

I had the issue running did --format markdown --since 2024-05-06 --until 2024-06-12.

For now, I'm using the following workaround:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index 888f41c..70e1233 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -150,6 +150,9 @@ class PublicInbox(object):
         parent_id = msg.parent_id()
         if parent_id not in self.messages_cache:
             root = self.__fetch_thread_root(msg)
+            if root is None:
+                return None
+
             log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
             return root
 
@@ -164,6 +167,9 @@ class PublicInbox(object):
             parent_id = parent.parent_id()
             if parent_id not in self.messages_cache:
                 root = self.__fetch_thread_root(msg)
+                if root is None:
+                    return None
+
                 log.debug(
                     "Found root message %s for message %s" %
                     (root.id(), msg.id()))
@@ -205,6 +211,9 @@ class PublicInbox(object):
 
             if not msg.is_thread_root():
                 root = self.__get_thread_root(msg)
+                if root is None:
+                    continue
+
                 root_id = root.id()
                 if root_id in found:
                     log.debug("Root message already encountered... Skip.")

stefano-garzarella avatar Jun 04 '24 12:06 stefano-garzarella

@stefano-garzarella @mripard it seems that issue fixed at https://github.com/mripard/did/pull/1 happens again. I see that the latest code that landed does not check if root is None before attempting to get root.id().

@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.

martinezjavier avatar Jul 15 '24 05:07 martinezjavier

@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.

@martinezjavier I'm not sure if continue is the best thing to do in get_all_threads() if we can't find root.

stefano-garzarella avatar Jul 15 '24 09:07 stefano-garzarella

I've spent some more time looking into it. The offending mail you first reported this issue for is: https://lore.kernel.org/all/AS2P194MB2170A39251AC72302E8F6F459ACB2@AS2P194MB2170.EURP194.PROD.OUTLOOK.COM/ and it looks like the culprit is due to root message not being archived by lore, and thus while the message has a correct in-reply-to chain, lore doesn't give us the root message.

I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.

mripard avatar Jul 15 '24 09:07 mripard

I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.

yeah, that would be much better! Thanks for working on a proper fix!

stefano-garzarella avatar Jul 15 '24 09:07 stefano-garzarella