nlpo3
nlpo3 copied to clipboard
Thai Natural Language Processing library in Rust, with Python and Node bindings.
nlpO3
Thai Natural Language Processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.
Features
- Thai word tokenizer
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
- load a dictionary from a plain text file (one word per line) or from
Vec<String>
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP - around 62,000 words (CC0)
- word break dictionary from libthai - consists of dictionaries in different categories, with make script (LGPL-2.1)
Usage
Command-line interface
echo "ฉันกินข้าว" | nlpo3 segment
Bindings
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")
As Rust library
In Cargo.toml
:
[dependencies]
# ...
nlpo3 = "1.3.2"
Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();
Create a tokenizer using a dictionary from a vector of Strings:
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);
Add words to an existing tokenizer:
tokenizer.add_word(&["มิวเซียม"]);
Remove words from an existing tokenizer:
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
Build
Requirements
Steps
Generic test:
cargo test
Build API document and open it to check:
cargo doc --open
Build (remove --release
to keep debug information):
cargo build --release
Check target/
for build artifacts.
Development documents
- Notes on custom string
Issues
Please report issues at https://github.com/PyThaiNLP/nlpo3/issues