llama_index Add parser/reader for .mbox files

Adds parser for .mbox files (MboxParser)
- This is useful to index dumps of mail inboxes, such as provided by Gmail (takeout.google.com)
- Can be used as part of SimpleDirectoryReader
- Supports max_count arg, determining maximum amount of messages to parse per mbox file. Useful for test-runs without using too many tokens.
- Supports message_format arg, overriding default string formatting for messages. See MboxParser class for default value.
- Dependencies: beautifulsoup4 (to strip messages of HTML syntax)
Adds reader for .mbox files (MboxReader)
- Iterates through a directory reading exclusively .mbox files

Linting

⚠️ One linting error suppressed: mbox_parser.py:66

error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected "Optional[Callable[[IO[Any]], mboxMessage]]"  [arg-type]

Would appreciate advice on whether or not this can be worked around differently.

Testing

Tested in a FastAPI environment using a Gmail .mbox dump.
MboxReaderDemo notebook added

Future concerns:

May benefit from testing with dumps from other e-mail providers
May benefit from further params such as filtering messages by date range

Feb 05 '23 13:02 minosvasilias

Thanks for the review! Will address comments and provide example screenshots once i'm off my day job. 👍

Feb 06 '23 10:02 minosvasilias

For an example of this working, here are some requests prompting an index of a single GMail .mbox file with 1000 parsed messages. Some minor info redacted.

download

Feb 06 '23 19:02 minosvasilias

thanks @minosvasilias! do you have a screenshot using the reader itself in python code / jupyter notebook? was looking for something like that to give users an example of the usage (i can also get from the jupyter notebook)

Feb 06 '23 23:02 jerryjliu

i'm cutting a release soon, may just land this now

Feb 06 '23 23:02 jerryjliu

Awesome, thank you! 🚀

Here the requested notebook screenshot: Screenshot 2023-02-07 at 01 15 37

Feb 07 '23 00:02 minosvasilias

llama_index llama_index copied to clipboard

Add parser/reader for .mbox files

Linting

Testing

Future concerns:

llama_index
llama_index copied to clipboard