email-parser icon indicating copy to clipboard operation
email-parser copied to clipboard

Fixed a lot of parsing errors

Open terhechte opened this issue 4 years ago • 2 comments

Hey! I recently ran email-parser on a batch of ~650.000 emails from 2004 - 2021. Initially roughly half of those emails were marked as invalid. I slowly went through the issues and fixed one after the other in order to be able to parse as many emails a possible. All the changes can be found in this PR. For me, it was important that I'm able to parse the majority of my mails. With these changes, all mails except for ~100 are parsed properly (and those are broken beyond repair, I had a brief look). However, I'm not sure if all of my changes are sensible additions. I added one commit for each fix so that you can easily figure out which ones are interesting. I also added tests (in the last commit) that involve most of the fixed issues.

Finally, I've added two feature flags in order to improve parsing:

  • allow-duplicate-headers: Gmail seems to add multiple To fields if forwarding from one Gmail account to another. Similarly, I had many emails with multiple Message-ID fields and so on. I've encapsulated this into a flag. If active, multiple headers are allowed, and in case the value is a list of items (e.g. Reply-To) they're all merged together. If inactive, the previous behavior of not allowing multiple headers stays as before.
  • decode-mime-body: I do need mime support for the parsing of Subject fields, but I'm not interested in bodies. Given that parsing of bodies might be expensive, I added a flag to specifically disable the parsing of bodies if mime is active.

I also added one dependency. This could also be made into a feature flag I guess. This library converts timezone abbreviations (such as GMT) to proper timezone information. I build that yesterday for the specific purpose of parsing timezone information in email-parser.

Cheers & Thanks for this nice library!

terhechte avatar Sep 29 '21 14:09 terhechte

Hey! Thank you very much for this PR! Your library crate for timezones is nice! I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere? I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes is explicitly enabled.

Ah no, these are my personal emails. I did a gmail download with the intention of generating statistics on them. I can probably dig out a couple of the worst offenders and remove some personal information from them (e.g. the body) to share.

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

terhechte avatar Oct 01 '21 06:10 terhechte

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

That would indeed be nice. keywords enables parsing for the keywords header and trace was supposed to parse all the trace-related headers. Unfortunately, the RFC is insanely vague and wrong about these headers. So I have no idea how I am supposed to implement them.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

I will do it if you can't :)

Mubelotix avatar Oct 02 '21 09:10 Mubelotix