google-takeout-to-sqlite Add Gmail takeout mbox import (v2)

Add Gmail takeout mbox import (v2)

Open maxhawkins opened this issue 2 years ago • 7 comments

WIP

This PR builds on #5 to continue implementing gmail import support.

Building on @UtahDave's work, these commits add a few performance and bug fixes:

Decreased memory overhead for import by manually parsing mbox headers.
Fixed error where some messages in the mbox would yield a row with NULL in all columns.

I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.

Jul 28 '21 07:07 maxhawkins

Just added two more fixes:

Added parsing for rfc 2047 encoded unicode headers
Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.

I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.

Aug 07 '21 00:08 maxhawkins

I added parsing of text/html emails using BeautifulSoup.

Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.

Aug 10 '21 23:08 maxhawkins

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Dec 29 '21 18:12 Btibert3

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html" clause from utils.py:254 and just return content directly without parsing.

Dec 31 '21 19:12 maxhawkins

google-takeout-to-sqlite google-takeout-to-sqlite copied to clipboard

Add Gmail takeout mbox import (v2)

google-takeout-to-sqlite
google-takeout-to-sqlite copied to clipboard