Thai NLP Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Libraries/Services
Thai Character Cluster
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
JTCC |
Thai Character Cluster |
Java |
|
GPL-3.0 |
Wittawat |
TCC |
Thai Character Cluster |
Python |
|
Apache 2.0 |
Wannaphong |
Sentiment Analysis
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
sentiment_analysis_thai |
|
|
|
|
JagerV3 |
Soundex
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
PyThaiNLP |
Python 3 |
LK82 + Udom83 |
Apache 2.0 |
Korakot, GitHub |
|
Word Segmentation
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Chamkho |
Lao/Thai word segmentation |
Rust |
LGPL |
GitHub |
|
CutKum |
Thai word segmentation with Deep Learning in Tensorflow. RNN. |
Python |
93% F-measure. |
MIT |
Pucktada, GitHub |
CutThai |
Thai word segmentation written in coffee-script Edit |
Coffee-script |
|
MIT |
Pureexe/cutthai GitHub |
DeepCut |
A Thai word tokenization library using Deep Neural Network. CNN. |
Python |
98.8% F-measure. |
MIT |
rkcosmos, GitHub |
Lexto: Thai Lexeme Tokenizer |
Java |
|
LGPL |
NECTEC |
|
Lexto |
Python 2 |
|
LGPL |
GitHub |
|
Lexto |
Python 3 |
|
LGPL |
GitHub |
|
Multi-Candidate-Word-Segmentation |
Multi Candidate Word Segmentation for Thai language |
Python, RNN, LSTM |
97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level) |
MIT |
paper, GitHub |
PyThaiNLP |
Python 3 |
Maximal matching and various other engines |
Apache 2.0 |
GitHub |
|
Swath |
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai |
C |
Longest Matching, Maximal Matching and Part-of-Speech Bigram. |
GPL |
Paisarn Charoenpornsawat, CMU |
SynThai |
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. |
Python |
99.2% F-measure |
MIT |
KenjiroAI, GitHub |
Thai Language Toolkit (tltk) |
Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included) |
Python |
97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.) |
GPLv3 |
PyPI |
Wordcut |
Thai word breaker for Node.js |
JavaScript, Node.JS |
|
LGPL-3.0 |
veer66, GitHub |
wordcutpy |
A simple Thai word tokenizer written in 1 Python file |
Python 3 |
|
LGPL-3.0 |
veer66, GitHub |
Part of Speech Tagging (POS Tagging)
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Chart-POS |
Thai POS Tagger |
C |
|
All rights reserved |
AIAT, KINDML, Thanaruk T. ([email protected]), tchayintr, Demo at iApp |
Jitar+NAiST |
A simple Trigram HMM part-of-speech tagger |
Java |
|
|
Ver66, Jitar + NAiST, 1 + NAiST, 2 |
SynThai |
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. |
Python |
0.9163 F-measure. RNN. LSTM |
MIT |
KenjiroAI, github |
Name Entity Recognition
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Named Entity Tagging (Thai NEST) |
Thai Named Entity tagging Specification and Tools |
|
|
GPL |
KINDML, SIIT, AIAT |
ThaiNER |
Thai Named Entity Recognition for PyThaiNLP |
Python |
|
Apache 2.0 (code) & CC BY 3.0 (Dataset) |
ThaiNER |
News Structure Tagging
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
News Structure Tagging Program |
Thai News Structure Tagging Program |
|
Metadata tagging, Structure tagging, Automatic News Title Generation |
GPL |
AIAT |
Syntactic Parsing & Tools
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Chart-parser |
Extract Syntactic Structure from POS Tagged Sentence. |
C |
|
All rights reserved |
AIAT, KINDML, Thanaruk T. ([email protected]), tchayintr, Demo at iApp |
Grammar Processing |
Labelled Brackets -> Context Free Grammars (CFGs) |
Python |
Transform and compute probability |
|
tchayintr |
Word Embedding
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
kobkrit-word-embedding |
Tensorflow implementation of Thai word embedding |
Python |
Source code, Example, Word distance graph |
LGPL |
Kobkrit V. |
Question Answering (Machine Comprehension)
Service |
Description |
License |
Author & Link |
Thai Machine Comprehension (ThaiMC) |
Bidirectional Attention Flow |
Copyright (As the service) |
iApp-AI |
Emojification
Corpus and Dataset
Dictionaries / Translation Pairs
Library |
Description |
Size |
Features |
License |
Link |
LEXiTRON |
Thai<->English Dictionary |
|
TH->EN, EN->TH |
LEXiTRON License |
NECTEC |
Transliteration Corpus |
|
31K pairs |
Thai-Eng Translation Pair |
CC BY-NC-SA 3.0 TH |
NECTEC |
Yaitron |
LEXiTRON in machine readable format (XML) |
|
TH->EN, EN->TH |
LEXiTRON License |
Veer66 Schema, Data & Conversion Code |
Downloadable Text Corpus
Library |
Description |
Size |
Features |
License |
Link |
Click Bait Sentences |
Thai Click Bait Sentence |
330 sent. (90.7KB) |
|
MIT |
Wannaphongcom |
InterBEST 2009/2010 |
|
5M words |
Word Seg. |
CC BY-NC-SA 3.0 TH |
NECTEC |
ORCHID |
|
30K sent. |
Word Seg., POS Tagged. |
CC BY-NC-SA 3.0 TH |
NECTEC |
Prime Minister 29 |
Prime Minister 29's Speech Sentences |
338KB |
Word segged, Name Entity Tagged |
MIT |
Wannaphongcom |
thai-jokes-corpus |
Cleaned Thai Jokes Corpus |
457 jokes |
|
GPLv3 |
iApp Technology |
Thai named entity corpora |
named entity corpora by Wirote Aroonmanakun's students |
266KB-1.5MB |
syllable seg., word seg., Named Entity tagged |
GPLv3 (not sure, but tltk is using this license) |
นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data |
THAI-NEST |
Thai-NEST: Thai Named Entity tagging Specification and Tools |
45K+ Name Entity Token |
Name Entity Tagged |
LGPL |
KINDML |
Thai Sentimental Word List |
Thai Sentimental Words List |
52KB |
Seperated Words as Adj, V |
MIT |
Wannaphongcom |
Thai Wikipedia |
Formal Articles |
1.49GB (~213.1 MB compressed) |
XML |
GFDL |
WIKIPEDIA |
Thai WordNet |
THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร) |
|
WordNet |
N/A |
ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008 |
TNC Top-5000 Words |
Word frequency |
5,000 words |
Frequency of Thai words in various genres, EXCEL |
All rights reserved |
CHULA |
Toxicity in Thai Tweet Corpus |
Tokyo Metropolitan University Natural Language Processing Group |
|
Each tweet is labeled as toxic or non-toxic |
CC BY-NC 4.0 |
tmu-nlp |
Wisesight Sentiment Corpus |
Social media message with sentiment label (positive, neutral, negative, question). |
~26,700 messages |
Sentiment label, Question label |
Public domain |
PyThaiNLP |
Web Query Text Corpus
Library |
Description |
Size |
Features |
License |
Link |
Thai National Corpus 2 |
|
32M words |
Query text by genre, domain |
All rights reserved |
CHULA |
Thai Medical Document |
|
3,594 docs |
Document and dynamic keyword map |
All rights reserved |
KINDML, SIIT |
Southeast Asian Languages Library |
Thai News, Web Text, Pop Music, Literature, Toponyms |
20M chars |
Phase around a search text |
|
SEALang |
HSE Thai Corpus |
Modern texts written in Thai language (mostly news websites) |
50M tokens |
Query by word form, lexeme, translation, grammatical attributes, lexical attributees |
|
HSE School of Linguistics |
Parallel Corpus
Library |
Description |
Size |
Features |
License |
Link |
TALPCo |
TUFS Asian Language Parallel Corpus |
1327 sent |
open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English |
CC BY 4.0 |
TALPCo |
Pre-trained Language Models
Pre-trained Model |
Description |
Size |
Dimensions |
License |
Link |
fastText |
Skip-Gram model trained on Wikipedia using fastText |
|
300 |
CC BY-SA 3.0 |
Facebook + Bin & Text + Text Only |
thai2fit |
ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings. |
70MB |
300 |
MIT |
thai2vec / PyThaiNLP |
thbert |
Yet another pre-trained BERT particularly in Thai |
|
|
Apache 2.0 |
tchayintr |
Benchmarks
Thai Text Classification Benchmarks
Tools
Corpus extractors
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
BEST2010 cooker |
A tool for extracting segmented words from Thai segmented BEST2010 corpus |
Python3 |
Extracting segmented words, features, and data divisions |
Apache 2.0 |
tchayintr |
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
https://resources.aiat.or.th/
Acknowledgements