bigbang
bigbang copied to clipboard
more data cleaning for 'full archive' email domain study
The email domain study has given us a comprehensive view of organizational participation in IETF working groups but has suffered from a lot of messiness in the data.
Some steps to take:
- [ ] remove admin domains: ietf.org, iana.org, etc.
- [ ] isolate top contributors from generic email domains like gmx.de, gmail.com, hotmail
- [ ] make sure emails are normalized with respect to case before analysis
See #509 -- there should be a suppported dataset of domain metadata in the repository. This is currently embedded in a couple notebooks but can be pulled out.