llama_index
llama_index copied to clipboard
Add parser/reader for .mbox files
- Adds parser for
.mbox
files (MboxParser
)- This is useful to index dumps of mail inboxes, such as provided by Gmail (takeout.google.com)
- Can be used as part of
SimpleDirectoryReader
- Supports
max_count
arg, determining maximum amount of messages to parse per mbox file. Useful for test-runs without using too many tokens. - Supports
message_format
arg, overriding default string formatting for messages. SeeMboxParser
class for default value. - Dependencies:
beautifulsoup4
(to strip messages of HTML syntax)
- Adds reader for
.mbox
files (MboxReader
)- Iterates through a directory reading exclusively
.mbox
files
- Iterates through a directory reading exclusively
Linting
⚠️ One linting error suppressed: mbox_parser.py:66
error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected "Optional[Callable[[IO[Any]], mboxMessage]]" [arg-type]
Would appreciate advice on whether or not this can be worked around differently.
Testing
- Tested in a FastAPI environment using a Gmail
.mbox
dump. -
MboxReaderDemo
notebook added
Future concerns:
- May benefit from testing with dumps from other e-mail providers
- May benefit from further params such as filtering messages by date range
Thanks for the review! Will address comments and provide example screenshots once i'm off my day job. 👍
For an example of this working, here are some requests prompting an index of a single GMail .mbox
file with 1000 parsed messages. Some minor info redacted.
thanks @minosvasilias! do you have a screenshot using the reader itself in python code / jupyter notebook? was looking for something like that to give users an example of the usage (i can also get from the jupyter notebook)
i'm cutting a release soon, may just land this now
Awesome, thank you! 🚀
Here the requested notebook screenshot: