tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Feature request: Characters delimiter argument

Open VasLem opened this issue 2 months ago • 1 comments

I wish to develop a k-mer-character-based BPE tokenizer using your beautiful Rust package, for genomic applications. Unfortunately, it doesn't seem to support defining a characters delimiter. As I see it, it is a pretty straightforward change, instead of iterating a word by character, first split it by the delimiter and then iterate. Also, when merges are computed, in the string representation the character delimiter should also be considered. In that way, a multi-character word splitting could have been made feasible. Right now I am using a modified Python version of the BPE tokenizer made by the genius Yikai-Liao, however it would be nice to see that happening in Rust as well, and natively supported by huggingface. Unfortunately, I am still novice in working with Rust, otherwise I would make a pull request with the suggested changes. Is it something that can be worked out in the future? Or is there a way to do this with the current implementation? Thank you!

VasLem avatar Nov 13 '25 10:11 VasLem

Unless I misunderstand, I think this is supported. You can split by Regex or a string by using a split pretokenizer.

from tokenizers import Regex
from tokenizers.pre_tokenizers import Split

pretokenizer = Split(Regex("\W"), behavior="isolated")

print(pretokenizer.pre_tokenize_str("hello,,,,,"))
# Leads to
[('hello', (0, 5)),
 (',', (5, 6)),
 (',', (6, 7)),
 (',', (7, 8)),
 (',', (8, 9)),
 (',', (9, 10))]

stephantul avatar Nov 28 '25 07:11 stephantul