llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Add parser/reader for .mbox files

Open minosvasilias opened this issue 2 years ago • 1 comments

  • Adds parser for .mbox files (MboxParser)
    • This is useful to index dumps of mail inboxes, such as provided by Gmail (takeout.google.com)
    • Can be used as part of SimpleDirectoryReader
    • Supports max_count arg, determining maximum amount of messages to parse per mbox file. Useful for test-runs without using too many tokens.
    • Supports message_format arg, overriding default string formatting for messages. See MboxParser class for default value.
    • Dependencies: beautifulsoup4 (to strip messages of HTML syntax)
  • Adds reader for .mbox files (MboxReader)
    • Iterates through a directory reading exclusively .mbox files

Linting

⚠️ One linting error suppressed: mbox_parser.py:66

error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected "Optional[Callable[[IO[Any]], mboxMessage]]"  [arg-type]

Would appreciate advice on whether or not this can be worked around differently.

Testing

  • Tested in a FastAPI environment using a Gmail .mbox dump.
  • MboxReaderDemo notebook added

Future concerns:

  • May benefit from testing with dumps from other e-mail providers
  • May benefit from further params such as filtering messages by date range

minosvasilias avatar Feb 05 '23 13:02 minosvasilias

Thanks for the review! Will address comments and provide example screenshots once i'm off my day job. 👍

minosvasilias avatar Feb 06 '23 10:02 minosvasilias

For an example of this working, here are some requests prompting an index of a single GMail .mbox file with 1000 parsed messages. Some minor info redacted.

download

minosvasilias avatar Feb 06 '23 19:02 minosvasilias

thanks @minosvasilias! do you have a screenshot using the reader itself in python code / jupyter notebook? was looking for something like that to give users an example of the usage (i can also get from the jupyter notebook)

jerryjliu avatar Feb 06 '23 23:02 jerryjliu

i'm cutting a release soon, may just land this now

jerryjliu avatar Feb 06 '23 23:02 jerryjliu

Awesome, thank you! 🚀

Here the requested notebook screenshot: Screenshot 2023-02-07 at 01 15 37

minosvasilias avatar Feb 07 '23 00:02 minosvasilias