indivisible icon indicating copy to clipboard operation
indivisible copied to clipboard

Scrap email content

Open pghosh opened this issue 7 years ago • 4 comments

Background

Action sites sends emails to the subscribers with action/event informations. These emails are in different format. This task is to identify best way to scrap the email body and save them as text so that further analysis can be done .

Acceptance Criteria

Make a call to Scraper.scrape for a given email and save content as raw text.

pghosh avatar Mar 08 '17 04:03 pghosh

Design thoughts

Scraper.scrap is the entry point of all the scrapping. Depending on type of emails (i.e simple html vs pictures vs plain text) we can have multiple method definitions if required and some delegator .

pghosh avatar Mar 09 '17 17:03 pghosh

I'm currently working on this. I'll probably have a PR soon-ish for small stuff, but the more involved bits might have to wait until next weekend. If someone else wants to knock it out in the meantime, more power to you. :)

eenblam avatar Mar 19 '17 04:03 eenblam

What are our example use cases for handling these differently? Anything less straightforward than something like this?

  1. Identify attachments
  2. Write attachment to disk
  3. Persist email_body, email_header, (attachment_path, attachment_headers); attachments flagged as unverified until validation succeeds
  4. Push attachments onto queue for security validation; external service?

I'm glossing over the actual parsing; I just want to make sure I'm not overlooking something.

Regarding number 4: what are our plans to ensure safe handling and storage of attachments? To avoid forwarding malicious attachments? I'm always happy to learn more about security, but my appsec-fu is weak here, and I'd rather not accidentally forward FinSpy to a bunch of activists.

Similarly, we need to escape JS embedded in email body. This should be implemented sooner than later.

https://zeltser.com/analyzing-malicious-documents/

eenblam avatar Mar 19 '17 19:03 eenblam