indicnlp_corpus
indicnlp_corpus copied to clipboard
does IndicCorpus and OSCAR corpus the same ?
Hey there, Does IndicCorpus and OSCAR corpus come from the same source. ie: CommonCrawl ? i have been thinking to combining OSCAR + IndicCorpus to get a better and bigger corpus(with deduplication). Just wanted to confirm if the IndicCorpus and OSCAR are the same corpus at source or not.
The version of IndicCorpus does not contain Oscar. However, the newer version that you can find here contains OSCAR as a subset - https://indicnlp.ai4bharat.org/corpora/
@anoopkunchukuttan how do i find the previous version of the corpus (that doesn't contain Oscar) ? btw, the Oscar Corpus is generated from common crawl corpus. when you said that the IndicCorpus does not contain Oscar. does it mean the IndicCorpus does not contain content from common crawl ?
@anoopkunchukuttan Hello Sir, just a follow up on the previous question.
- does the corpus contain wikipedia content ?
- is there a way i could get the previous version of Indic Corpus that doesn't contain oscar corpus ?
- is there a way i could get the corpus in unshuffled format ?
as i would be adding content from oscar corpus separately. also additionally is there a way i could get the corpus in unshuffled format.