google-takeout-to-sqlite icon indicating copy to clipboard operation
google-takeout-to-sqlite copied to clipboard

Add Gmail takeout mbox import (v2)

Open maxhawkins opened this issue 2 years ago • 7 comments

WIP

This PR builds on #5 to continue implementing gmail import support.

Building on @UtahDave's work, these commits add a few performance and bug fixes:

  • Decreased memory overhead for import by manually parsing mbox headers.
  • Fixed error where some messages in the mbox would yield a row with NULL in all columns.

I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.

maxhawkins avatar Jul 28 '21 07:07 maxhawkins

Just added two more fixes:

  • Added parsing for rfc 2047 encoded unicode headers
  • Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.

I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.

maxhawkins avatar Aug 07 '21 00:08 maxhawkins

I added parsing of text/html emails using BeautifulSoup.

Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.

maxhawkins avatar Aug 10 '21 23:08 maxhawkins

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Btibert3 avatar Dec 29 '21 18:12 Btibert3

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html" clause from utils.py:254 and just return content directly without parsing.

maxhawkins avatar Dec 31 '21 19:12 maxhawkins