email-scraper
email-scraper copied to clipboard
Weird/broken emails
Sometimes the scrapper returns some weird looking and broken emails that look like this:
- &ssf=f2602a32dcc003e106302a076e00c549c516a88f&ssg=81bbd28f-fc86-d194-9e27-3378752fe5b6&ssh=e0058443-aa6a-d194-9e28-defdea71a2bf&ssi=126f6acb-89ae-d194-927f-c69af1fdd7a6&ssj=99e89daf-a030-d194-9e28-85fb595c317d&[email protected]
- //lists.wikimedia.org/hyperkitty/list/[email protected]
- //lists.wikimedia.org/hyperkitty/list/[email protected]
- //[email protected]
Unfortunately, I can't provide exact links from emails were scrapped, because I'm using tons of links (scraping from Google), but they mostly come from Wikipedia I think.
As far as I can tell, technically those are valid email addresses. Though it does seem like we need to limit the local part to 64 bytes. You may want to filter the emails after you get them from the library. Or maybe open a PR for "common email addresses" that adds a flag to disallow slashes and other non-common symbols.