nlpo3 icon indicating copy to clipboard operation
nlpo3 copied to clipboard

Thai Natural Language Processing library in Rust, with Python and Node bindings.

nlpO3

Thai Natural Language Processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.

Features

  • Thai word tokenizer
    • use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
      • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • load a dictionary from a plain text file (one word per line) or from Vec<String>

Dictionary file

  • For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Usage

Command-line interface

  • nlpo3-cli crates.io
echo "ฉันกินข้าว" | nlpo3 segment

Bindings

  • Node.js
  • Python pypi
from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

As Rust library

crates.io

In Cargo.toml:

[dependencies]
# ...
nlpo3 = "1.3.2"

Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:

let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Build

Requirements

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Development documents

  • Notes on custom string

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues