google-takeout-to-sqlite
google-takeout-to-sqlite copied to clipboard
Add Gmail takeout mbox import (v2)
WIP
This PR builds on #5 to continue implementing gmail import support.
Building on @UtahDave's work, these commits add a few performance and bug fixes:
- Decreased memory overhead for import by manually parsing mbox headers.
- Fixed error where some messages in the mbox would yield a row with NULL in all columns.
I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.
Just added two more fixes:
- Added parsing for rfc 2047 encoded unicode headers
- Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.
I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.
I added parsing of text/html emails using BeautifulSoup.
Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.
@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.
@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.
Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html"
clause from utils.py:254 and just return content directly without parsing.