indicnlp_corpus icon indicating copy to clipboard operation
indicnlp_corpus copied to clipboard

does IndicCorpus and OSCAR corpus the same ?

Open StephennFernandes opened this issue 3 years ago • 3 comments

Hey there, Does IndicCorpus and OSCAR corpus come from the same source. ie: CommonCrawl ? i have been thinking to combining OSCAR + IndicCorpus to get a better and bigger corpus(with deduplication). Just wanted to confirm if the IndicCorpus and OSCAR are the same corpus at source or not.

StephennFernandes avatar Apr 29 '22 11:04 StephennFernandes

The version of IndicCorpus does not contain Oscar. However, the newer version that you can find here contains OSCAR as a subset - https://indicnlp.ai4bharat.org/corpora/

anoopkunchukuttan avatar Apr 29 '22 12:04 anoopkunchukuttan

@anoopkunchukuttan how do i find the previous version of the corpus (that doesn't contain Oscar) ? btw, the Oscar Corpus is generated from common crawl corpus. when you said that the IndicCorpus does not contain Oscar. does it mean the IndicCorpus does not contain content from common crawl ?

StephennFernandes avatar Apr 29 '22 12:04 StephennFernandes

@anoopkunchukuttan Hello Sir, just a follow up on the previous question.

  • does the corpus contain wikipedia content ?
  • is there a way i could get the previous version of Indic Corpus that doesn't contain oscar corpus ?
  • is there a way i could get the corpus in unshuffled format ?

as i would be adding content from oscar corpus separately. also additionally is there a way i could get the corpus in unshuffled format.

StephennFernandes avatar May 04 '22 09:05 StephennFernandes