awesome-kyrgyz-nlp
awesome-kyrgyz-nlp copied to clipboard
Kyrgyz language processing software, models and datasets.
Awesome Kyrgyz NLP data:image/s3,"s3://crabby-images/67aad/67aad24db041f8e850d074e0216eaf8ecbf7fa20" alt="Awesome"
A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed to for a long time (2~3 years).
Table of Contents
-
Awesome Kyrgyz NLP
- Table of Contents
- Datasets
- Pretrained models
-
Methods/Software
- Morphology
- Online Demos
- Miscellaneous
Datasets
Corpora
- Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
- kkWaC: Kyrgyz corpus from the web, 19M words, Jan 2012
- Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
- TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023
Character recognition
- Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80213) have been transformed to 50x50 images, then to CSV format
Raw text
- kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code
Morphology & Syntax
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
- KTMU's UD Treebank, 781 sentences
- Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff
Named Entity Recognition
Text Classification
- Kyrgyz Multi-Label News Classification: [not published yet]
Word Similarity Data
- Kyrgyz Word Embedding Evaluation: [not published yet]
Instructions
- Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate
Machine-readable dictionaries
- Country names table: Kyrgyz-Russian-English
- Thesaurus KyrSpell (however, unpacking it seems to break the license)
- Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)
Pretrained models
- Polyglot morfessor — pretrained morfessor model, number 6
- fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
- compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
-
BERT-based NER —
bert-base-multilingual-cased
fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
Methods/Software
-
spaCy basic support: tokenization, stopwords,
like_num
Morphology
-
Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh. A much, much easier way:
import apertium; apertium.installer.install_module("kir")
. - [DEPRECATED] kymopl: Kyrguz morphology in Prolog
Mentioned in papers:
- TODO
Hate Speech detection
Other
- Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
- ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
- Number-to-words conversion (JavaScript) by @AzamatSooldaev
- Number-to-words conversion (TypeScript) by @timursaurus
- Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz
Online Demos
Miscellaneous
- Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
- A useful Apertium's list of tools and other resources
- Online dictionaries and other useful resources on el-sozduk.kg
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University