language icon indicating copy to clipboard operation
language copied to clipboard

CC-news reproduction

Open sjy1203 opened this issue 4 years ago • 0 comments

Dear authors,

I want to use the CC-news dataset to train my model.

Now, I use https://github.com/fhamborg/news-please to construct CC-news corpus from CC.

But I don't know if it's the right way to obtain CC-news by directing running

python3 -m newsplease.examples.commoncrawl

Questions:

  • Did you use the same tool? Did you add some extra filtering rules?
  • Since you can't release the CC-news corpus, could you tell me how to collect CC-news corpus by myself?

sjy1203 avatar Aug 26 '20 07:08 sjy1203