tokenizer topic
tokenizer
[DISCONTINUED] Source code tokenizer
html5gum
A WHATWG-compliant HTML5 tokenizer and tag soup parser
JapaneseTokenizers
aim to use JapaneseTokenizer as easy as possible
SoMaJo
A tokenizer and sentence splitter for German and English web and social media texts.
vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
peast
JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification
rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
works-for-me
Collection of developer toolkits
syntok
Text tokenization and sentence segmentation (segtok v2)
sentence-splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.