language
language copied to clipboard
CC-news reproduction
Dear authors,
I want to use the CC-news dataset to train my model.
Now, I use https://github.com/fhamborg/news-please to construct CC-news corpus from CC.
But I don't know if it's the right way to obtain CC-news by directing running
python3 -m newsplease.examples.commoncrawl
Questions:
- Did you use the same tool? Did you add some extra filtering rules?
- Since you can't release the CC-news corpus, could you tell me how to collect CC-news corpus by myself?