email-scraper icon indicating copy to clipboard operation
email-scraper copied to clipboard

Weird/broken emails

Open kajto3 opened this issue 4 years ago • 1 comments

Sometimes the scrapper returns some weird looking and broken emails that look like this:

  • &ssf=f2602a32dcc003e106302a076e00c549c516a88f&ssg=81bbd28f-fc86-d194-9e27-3378752fe5b6&ssh=e0058443-aa6a-d194-9e28-defdea71a2bf&ssi=126f6acb-89ae-d194-927f-c69af1fdd7a6&ssj=99e89daf-a030-d194-9e28-85fb595c317d&[email protected]
  • //lists.wikimedia.org/hyperkitty/list/[email protected]
  • //lists.wikimedia.org/hyperkitty/list/[email protected]
  • //[email protected]

Unfortunately, I can't provide exact links from emails were scrapped, because I'm using tons of links (scraping from Google), but they mostly come from Wikipedia I think.

kajto3 avatar Aug 22 '21 21:08 kajto3

As far as I can tell, technically those are valid email addresses. Though it does seem like we need to limit the local part to 64 bytes. You may want to filter the emails after you get them from the library. Or maybe open a PR for "common email addresses" that adds a flag to disallow slashes and other non-common symbols.

kichik avatar Aug 23 '21 02:08 kichik