yoruba-text
yoruba-text copied to clipboard
Add OSCAR corpus
Add https://oscar-corpus.com, common crawl from the BBC to the working corpus for ADR and other monolingual tasks
Language | Words original | Size original | File original | Words deduplicated | Size deduplicated | File deduplicated
Yoruba | 8,906 | 55K | yo.txt.gz | 3,518 | 27K | yo_dedup.txt.gz