uax29
uax29 copied to clipboard

→

Metadata

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.

Readme
Issues

This package tokenizes (splits) words, sentences and graphemes, based on Unicode text segmentation (UAX #29), for Unicode version 13.0.0. Details and usage are in the respective packages:

uax29/words

uax29/sentences

uax29/graphemes

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. The Unicode standard is better: it is multi-lingual, and handles punctuation, special characters, etc.

Conformance

We use the official test suites. Status:

Prior art

blevesearch/segment

rivo/uniseg

Other language implementations

About

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.

go

golang

tokenizer

nlp

unicode

tokenization

uax29

51

Stars

3

Forks

Watchers

Owner

clipperhouse

← Metadata

51

Stars

3

Forks

Watchers

Owner

clipperhouse

Metadata

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.

Back

uax29 uax29 copied to clipboard

Metadata

Why tokenize?

Conformance

See also

Prior art

Other language implementations

← Metadata

Owner

Metadata

uax29
uax29 copied to clipboard