yoruba-text icon indicating copy to clipboard operation
yoruba-text copied to clipboard

Yorùbá language training text for NLP, ASR and TTS tasks

Yorùbá text

This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:

def convert_to_NFC(filename, outfilename):
    text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
    with open(outfilename, 'w') as f:
        f.write(text)

Sources:

Sources yet to be scraped and cleaned

Social Media sources:

  • https://twitter.com/yobamoodua
  • https://twitter.com/yoruba_proverbs
  • https://www.facebook.com/oweyoruba

Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!

Resources

  • https://clas.uiowa.edu/dwllc/allnet/yoruba-language-and-culture-resources
  • https://glosbe.com/yo/en

Bibtex

If you want to cite this repo in your work, please use:

@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}