tokenizer topic

List tokenizer repositories

tokenizer

140
Stars
23
Forks
Watchers

[DISCONTINUED] Source code tokenizer

html5gum

147
Stars
11
Forks
Watchers

A WHATWG-compliant HTML5 tokenizer and tag soup parser

JapaneseTokenizers

136
Stars
20
Forks
Watchers

aim to use JapaneseTokenizer as easy as possible

SoMaJo

134
Stars
20
Forks
Watchers

A tokenizer and sentence splitter for German and English web and social media texts.

vaporetto

218
Stars
10
Forks
Watchers

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

peast

169
Stars
20
Forks
Watchers

JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification

rust-tokenizers

275
Stars
26
Forks
Watchers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

syntok

197
Stars
33
Forks
Watchers

Text tokenization and sentence segmentation (segtok v2)

sentence-splitter

218
Stars
29
Forks
Watchers

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.