massivetext topic

List massivetext repositories

c4-dataset-script

115
Stars
13
Forks
Watchers

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.