yoruba-text
yoruba-text copied to clipboard
Yorùbá language training text for NLP, ASR and TTS tasks
Yorùbá text
This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:
def convert_to_NFC(filename, outfilename):
text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
with open(outfilename, 'w') as f:
f.write(text)
Sources:
- Lagos-NWU conversational corpus
- Bíbélì Mímọ́ ní Èdè Yorùbá Òde-Òní
- The Yorùbá blog
- Asubiaro, T., Adegbola, T. et al. (2018). A Word-Level Language Identification Strategy for Resource-Scarce Languages
- Òwe Yorùbá
- Ìwé Ti Mọ́mọ́nì
- Kùránì (Qur'an) Mímọ́
Sources yet to be scraped and cleaned
- BBC Yorùbá
- Yorùbá for Academic Purpose
- Yobá mọ oduá
- Àwa Ẹlẹ́rìí Jèhófà
- Orí Kìíní
- Iwé ti Nicé
- Alákọ̀wé
- Èdè Yorùbá Rẹwà
- Ìmọ̀_Ẹ̀rọ
- ọ̀rọ̀yorùbá
- Wikipedia
- Poetry of Ọláńrewájú Adépọ̀jù
Social Media sources:
- https://twitter.com/yobamoodua
- https://twitter.com/yoruba_proverbs
- https://www.facebook.com/oweyoruba
Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!
Resources
- https://clas.uiowa.edu/dwllc/allnet/yoruba-language-and-culture-resources
- https://glosbe.com/yo/en
Bibtex
If you want to cite this repo in your work, please use:
@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}