iranlowo
iranlowo copied to clipboard
Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo
Ìrànlọ́wọ́
Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The focus is on helping software developers build large, clean text datasets for (further) diacritic restoration and machine translation tasks.
Features
ADR tools
- [X] Strip all diacritics from word-types
- [X] Verify that text is NFC or NFD
- [X] Normalize a corpus (from MS Word or elsewhere) → NFC
- [X] Split long sentences on certain characters like
;
,:
, etc - [X] Automatically restore correct diacritics using a pre-trained model
- [X] Find all variants of all word-type in a given corpus
- [ ] Partially strip diacritics from word-types
Ready to use webpage scrapers
- [X] Bíbélì Mímọ́ (Biblica, Bible Society of Nigeria)
- [ ] Yorùbá Blog
- [ ] BBC Yorùbá
Corpus analysis tools
- [X] Dataset character distribution
- [X] Dataset ambuiguity statistics → Lexdif, etc for a given corpus
- [ ] Dataset scoring (proximity to correctly diacritized text, LM perplexity, KL divergence)
Installation
Obtainable from the Python Package Index (PyPI) → pip install iranlowo
Example
- Show computing environment and installation process
data:image/s3,"s3://crabby-images/5b313/5b313c11ffd91678379a1a9a2c0c4cc328c3747b" alt=""
- Diacritize a phrase
$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iranlowo.adr as ránlọ
>>> ránlọ.diacritize_text("lootoo ni pe ojo gbogbo ni ti ole")
PRED AVG SCORE: -0.0037, PRED PPL: 1.0037
'lóòtóọ́ ni pé ọjọ́ gbogbo ni ti olè'
- Diacritize phrases, note we use
ipython
only because it renders nicer, easy-to-read text-colours in the terminal!
data:image/s3,"s3://crabby-images/f39f2/f39f27aeea12fd29aa473fd1660d39a7b274309c" alt=""
Disclaimer
This is beta software, if you pass the diacritizer out-of-domain text, English, pidgin or any other non-Yorùbá text, you will experience very marvelous, black-box results.
Since this a work-in-progress and we are steadily improving, if you encounter any problems with correctness or performance, please submit pull-requests with corrections or file an issue.
License
This project is licensed under the MIT License.