bpe topic
SubwordEncoding-CWS
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
subword-nmt
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency
nlp_made_easy
Explains nlp building blocks in a simple manner.
phishytics-machine-learning-for-phishing
Machine Learning for Phishing Website Detection
piecelearn
Learning BPE embeddings by first learning a segmentation model and then training word2vec
gpt-tokenizer
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o. Port of OpenAI's tiktoken with additional features.
tiktoken-rs
Ready-made tokenizer library for working with GPT and tiktoken