massivetext topic
List
massivetext repositories
c4-dataset-script
115
Stars
13
Forks
Watchers
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.