tokenization topic

List tokenization repositories

trankit

716
Stars
97
Forks
Watchers

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

ekphrasis

660
Stars
92
Forks
Watchers

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashta...

Tokenizer

268
Stars
66
Forks
Watchers

Fast and customizable text tokenization library with BPE and SentencePiece support

YouTokenToMe

945
Stars
95
Forks
Watchers

Unsupervised text tokenizer focused on computational efficiency

TokenScript

239
Stars
69
Forks
Watchers

TokenScript schema, specs and paper

Python_Natural_Language_Processing

190
Stars
173
Forks
Watchers

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand...

vtext

147
Stars
11
Forks
Watchers

Simple NLP in Rust with Python bindings

datacamp-python-data-science-track

758
Stars
526
Forks
Watchers

All the slides, accompanying code and exercises all stored in this repo. 🎈

NLP-Cube

552
Stars
93
Forks
Watchers

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

vaporetto

218
Stars
10
Forks
Watchers

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer