awesome-kurdish
awesome-kurdish copied to clipboard
A curated list of awesome resources and tools for Kurdish language technology
Awesome Kurdish
(last updated on 04/03/2024)
A curated list of awesome resources, tools and scientific papers for Kurdish language technology
Although I do my best to keep this page as comprehensive as possible by including all projects, the list may not include all the fantastic small and big projects regarding Kurdish language processing. Please be kind and notify me by reaching out by email or through our community on Gitter.
Are you interested in contributing to Kurdish language processing? Check out this post to see how you can do so.
News 🎉
March 2023
- A few datasets are added for automatic speech recognition and Central Kurdish dialect identification and translation
April 2023
- A few datasets are added for emotion analysis, summarization and news headline classification
- Two projects are released for language identification of Zaza-Gorani and Kurdish langauges.
- A benchmark is released for sentiment analysis of Central Kurdish.
Development
Resources
Language Models
- Kurdish Llama (Fine-tuned Llama model for Sorani)
Corpora
- CORDI (Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad, Erbil, Sardasht and Kalar)
- Open Super-large Crawled ALMAnaCH coRpus (OSCAR) (Sorani and Kurmanji)
- Pewan (Sorani and Kurmanji)
- Kurdish folkloric lyrics corpus (Sorani)
- AsoSoft corpus (Sorani)
- Kurdish Textbooks Corpus (Sorani)
- Zaza-Gorani corpus (Zazaki and Gorani)
- Southern Kurdish and Laki corpora (Southern Kurdish and Laki)
- Kurdish resources on Clarin
- University of Bamberg's corpora [Kurmanji & Laki]
Parallel corpora
- CORDI (Parallel corpus of Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad and Erbil along with Standard Central Kurdish and English)
- Ataman's Bianet corpus containing Turkish-English-Kurmanji aligned texts
- Ahmadi et al's corpus containing English-Kurmanji-Sorani aligned texts
- Tanzil: one Qoran translation alignable with many other translations in other languages, including 11 in English (see this project)
- Bible translations in Kurmanji-Latin and Kurmanji-Cyrillic
- TED Talks subtitles
- HLP Colloquial Corpus #1 (Sorani and Kurmanji (Latin and Arabic)) (not free)
- A parallel corpus of Sorani-English text
- FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation (Sorani)
- AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech (Sorani)
Dictionaries, terminologies and ontologies
Check out a comprehensive list of Kurdish dictionaries and beware of copyright issues in the following projects:
- Kurdî Wikibase (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Kurdish lexicographical resources in Ontolex-Lemon (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Check Dolan Hêriş's repositories for a list of Kurdish dictionaries and tools to extract words
- KurdNet-the Kurdish wordNet (Sorani)
- Kurdish annotated lexicon (Sorani)
- Freedict word lists (Sorani and Kurmanji)
- Translation Initiative for COVID-19 including Sorani and Kurmanji
- MyMemory dictionaries with an open-access API (Sorani)
Datasets
- Manchester Database of Kurdish Dialects
- Dataset of Kurdish poems with meter and form tags
- A Twitter dataset (Sorani and Kurmanji)
- Datasets for text to Kurdish Sign Language (Sorani)
- A dataset for speech recognition (Sorani)
- Universal dependency (Kurmanji)
- Web Inventory of Transcribed and Translated Talks (WIT3) (Sorani)
- Sorani and Kurmanji morphological datasets in UniMorph
- FakeKurdNews, an annotated dataset for Sorani Kurdish fake news detection
- profanity language (Sorani)
- Cyberbullying dataset (Sorani)
- Summarization dataset (Sorani)
- Sentiment analysis (Sorani)
- Emotion analysis (Sorani)
- News headline classification (Sorani)
Automatic speech recognition
- CORDI (Central Kurdish varieties)
- KASET - Kurmanji and Sorani Kurdish Speech and Transcripts
- Whisper model on Central Kurdish
- Kurdish spoken dialect recognition using x-vector speaker embedding (Northern, Central, Southern Kurdish, Hawrami & Zazaki)
Benchmarks
- Morphological analysis:
- KurdishHunspell evaluation datasets (Sorani)
- Tokenization:
- KurdishTokenization (Sorani, Kurmanji)
- A sentence-segmented dataset (Sorani)
- Transliteration
- Spelling error correction
- Sentiment Analyis
- Sentiment Analysis (Sorani)
- Unconventional writing normalization
Other resources
Word Embeddings:
- fastText word vectors (Sorani and Kurmanji)
- Polyglot's word embeddings
Tools
Fundamental processing
- Kurd-Spell
- Wergor for transliteration (Sorani and Kurmanji)
- Kurdish Tokenization
- Jedar stemmer
- Apertium project for Kurmanji and Sorani morphological analysis
- Kurdish Hunspell for Sorani morphological analysig, spell checking, stemming and lemmatization
- A finite-state morphological analyzer for Central Kurdish (Sorani)
- Part-of-speech tagger (Sorani)
-
Alexina Framework: morphological analysis and POS-tagger for Sorani (
soralex
) and Kurmanji (kurlex
) - Kurdspell for Sorani spell checking
- Apertium rule-based Sorani spell-checker
- Gende Stemmer (Sorani)
- Conversion of numbers into words (Sorani and Kurmanji)
- Conversion of words into IPA (Kurmanji)
Machine translation
- Apertium (Sorani and Kurmanji)
- Kurdish MT (Sorani)
Named-entity recognition
- Autoregressive Entity Retrieval (Kurmanji)
Optical character recognition
- Kurdish Handwritten Words (Sorani)
Libraries
- Kurdish Language Processing Toolkit: a natural language processing toolkit in Python
- Kurdînûs: pure JavaScript tools for transliteration, text conversion and normalization
- Kurdish Language Library: converting characters and digits in Persian, English and Arabic to Kurdish and vice versa
- AsoSoft's Library for Kurdish: normalizer, numeral converter, grapheme-to-phoneme convertor in C#
Language identification
- CORDI (Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad, Erbil, Sardasht and Kalar)
- Language identification of Kurdish and Zaza-Gorani languages
- Perso-Arabic and KurdishLID projects covering many languages including (Kurmanji, Sorani, Southern Kurdish, Gorani and Zazaki)
- Language identifier (Sorani and Kurmanji)
Other
In addition to these, you can find further information in other repositories and pages as follows:
Research
These references are provided based on the data collected in the paper entitled KLPT – Kurdish Language Processing Toolkit. Note that references are provided in the bibliography
file.
Reference | Year | Field | dialects |
---|---|---|---|
esmaili2013sorani |
2013 | Dialectology | Sorani, Kurmanji |
hassani2016automatic |
2016 | Dialectology | Sorani, Kurmanji |
malmasi2016subdialectal |
2016 | Dialectology | Sorani |
al2017kurdish |
2017 | Dialectology | Sorani, Kurmanji, Gorani |
amani:hal-03262435 |
2021 | Dialectology | Kurdish, Zazaki & Gorani |
ahmadi2024cordi |
2024 | Dialectology | Sorani varieties |
mohammed2012automatic |
2012 | Information retrieval and Text mining | Sorani |
esmaili2012challenges |
2012 | Information retrieval and Text mining | Sorani |
littell2016named |
2016 | Information retrieval and Text mining | Sorani |
hassani2017method |
2017 | Information retrieval and Text mining | Sorani, Kurmanji |
esmaAl-Talabaniili2014towards |
2014 | Information retrieval and Text mining | Sorani, Kurmanji |
jaf2016simple |
2016 | Information retrieval and Text mining | Sorani |
rashid2017robust |
2017 | Information retrieval and Text mining | Sorani |
rashid2017automatic |
2017 | Information retrieval and Text mining | Sorani |
saeed2018improving |
2018 | Information retrieval and Text mining | Sorani |
mustafa2018kurdish |
2018 | Information retrieval and Text mining | Sorani |
saeed2018evaluation |
2018 | Information retrieval and Text mining | Sorani |
ahmadi2019wergor |
2019 | Information retrieval and Text mining | Sorani |
mahmudi2021automated |
2021 | Information retrieval and Text mining | Sorani |
abdulrahman2022lmspell |
2022 | Information retrieval and Text mining | Sorani |
esmaili2013building |
2013 | Lexical resources | Sorani |
aliabadi2014towards |
2014 | Lexical resources | Sorani |
aliabadi2014semi |
2014 | Lexical resources | Sorani |
ataman2018bianet |
2018 | Lexical resources | Kurmanji |
ahmadi2019towards |
2019 | Lexical resources | Sorani, Kurmanji, Gorani |
abdulrahman2019developing |
2019 | Lexical resources | Sorani |
abdulrahman2020using |
2020 | Lexical resources | Sorani |
veisi2020toward |
2020 | Lexical resources | Sorani |
ahmadi2020corpus |
2020 | Lexical resources | Sorani |
ahmadi-2020-building |
2020 | Lexical resources | Zaza, Gorani |
veisi2021jira |
2021 | Lexical resources | Sorani |
azin2021sk |
2021 | Lexical resources | Southern Kurdish |
hassani2017kurdish |
2017 | Machine Translation | Sorani, Kurmanji |
kaka2018english |
2018 | Machine Translation | Sorani |
ahmadi2020machine |
2020 | Machine Translation | Sorani |
goyal2021flores |
2021 | Machine Translation | 101 languages incl. Sorani |
amini2021central |
2021 | Machine Translation | Sorani |
ahmadi2022leveraging |
2022 | Machine Translation | Sorani |
ahmadi2024cordi |
2024 | Machine Translation | Sorani |
baban1995programmable |
1995 | Morphological and syntactic analysis | Sorani |
walther2010developing |
2010 | Morphological and syntactic analysis | Sorani |
walther2010fast |
2010 | Morphological and syntactic analysis | Kurmanji |
salavati2013stemming |
2013 | Morphological and syntactic analysis | Sorani |
jaf2014stemmer |
2014 | Morphological and syntactic analysis | Sorani |
jaf2016chapter |
2016 | Morphological and syntactic analysis | Sorani |
gokirmak2017dependency |
2017 | Morphological and syntactic analysis | Kurmanji |
salavati2018building |
2018 | Morphological and syntactic analysis | Sorani |
mustafa2018kurdish |
2018 | Morphological and syntactic analysis | Sorani |
ahmadi2020towards |
2020 | Morphological and syntactic analysis | Sorani |
ahmadi-2020-tokenization |
2020 | Morphological and syntactic analysis | Sorani, Kurmanji |
ahmadi2021modelling |
2021 | Morphological and syntactic analysis | Sorani |
ahmadi2020Hunspell |
2021 | Morphological and syntactic analysis | Sorani |
naserzade2021ckmorph |
2021 | Morphological and syntactic analysis | Sorani |
ahmadi2023revisiting |
2023 | Morphological and syntactic analysis | Sorani |
mohammed2012uniqueness |
2012 | Optical character recognition | Sorani |
mohammed2013handwritten |
2013 | Optical character recognition | Sorani |
shaltookisentiment |
2016 | Optical character recognition | Sorani |
zarro2017recognition |
2017 | Optical character recognition | Sorani |
yaseen2018kurdish |
2018 | Optical character recognition | Sorani |
dinler2018kurdish |
2018 | Optical character recognition | Sorani |
app11209752 |
2021 | Optical character recognition | Sorani |
kaka2017building |
2017 | Other | Sorani |
mahmudi2021automatic |
2021 | Other | Sorani |
ahmadi2021ickl |
2021 | Other | Sorani |
ahmadi2023script |
2023 | Other | Sorani, Kurmanji, Gorani |
hashim2018kurdish |
2018 | Sign language recognition | Sorani |
kamal-hassani-2020-towards |
2020 | Sign language recognition | Sorani |
daneshfar2009implementation |
2009 | Speech recognition | Sorani |
barkhoda2009comparison |
2009 | Speech recognition | Sorani |
bahrampour2009implementation |
2009 | Speech recognition | Sorani |
hassani2011kurdish |
2011 | Speech recognition | Sorani |
dinler2017formant |
2017 | Speech recognition | Kurmanji |
dinler2018extraction |
2018 | Speech recognition | Sorani, Kurmanji |
qader2019kurdish |
2019 | Speech recognition | Sorani |
delgado2024kaset |
2024 | Speech recognition | Sorani, Kurmanji |
ahmadi2024cordi |
2024 | Speech recognition | Sorani varieties |
ahmadi-2020-klpt |
2020 | Toolkits | Sorani, Kurmanji |
de2021multilingual |
2021 | Named-entity recognition | Kurmanji |
abdullah2022 |
2022 | Sentiment analysis | Sorani |
awlla2022 |
2022 | Sentiment analysis | Sorani |
amin2022kurdish |
2022 | Sentiment analysis | Sorani |
hameed2023sentiment |
2023 | Sentiment analysis | Sorani |
zuhair2021 |
2021 | Other | Sorani |
kamala2022kurdish |
2022 | Other | Sorani |
ahmadi2023fieldmatters |
2023 | Language identification | Sorani, Kurmanji, Southern Kurdish, Zazaki, Gorani |
ahmadi2023pali |
2023 | Language identification | Sorani, Kurmanji, Southern Kurdish, Gorani |
Cite this repository
If you find the provided data useful for your project, feel free to use it and please, cite the following paper, too:
@inproceedings{ahmadi-2020-klpt,
title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
author = "Ahmadi, Sina",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
doi = "10.18653/v1/2020.nlposs-1.11",
pages = "72--84"
}