Proposal: Faster `Whitespace` PreTokenizer Implementation (10โ30% Speedup)
Hi Hugging Face team ๐,
Iโd like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10โ30% performance improvements across short, medium, and long inputs, while preserving identical output behavior.
๐ Why This Matters
Whitespace is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a compounding benefit at scale, especially in multi-threaded, batched workflows.
โก Benchmarks (Criterion)
I benchmarked both implementations across multiple runs with consistent patterns:
๐งช Inputs
-
Short: e.g.,
"Hello world!"(~10โ20 characters) -
Medium: typical sentences (~100โ150 characters)
-
Long: paragraphs or documents (~5,000+ characters)
โ Optimized Version (mine)
| Input Type | Time (avg) | Change |
|---|---|---|
| Short | 549โ559 ns | 10โ15% faster |
| Medium | 3.86โ4.01 ยตs | 5โ30% faster |
| Long | 50.8โ71 ยตs | 5โ15% faster, more stable |
๐งฌ Output Compatibility
-
Produces the same pre-tokenization splits as the original
-
Word boundaries, punctuation, and whitespace are handled identically
-
Includes unit tests that confirm offset and string correctness
๐ง Technical Summary
-
Replaces regex-based character matching with a manual
char_indices()loop -
Classifies spans as word, whitespace, or punctuation without allocations
-
No external dependencies
-
Cleaner and more cache-friendly structure
-
Fully backward compatible, including
impl_serde_type!
๐ฆ Integration Options
I'd be happy to:
-
Submit a PR replacing the current implementation
-
Or submit it alongside as
WhitespaceFastfor side-by-side evaluation
Thanks again for maintaining this fantastic library. Let me know your preferences and Iโll submit the PR accordingly! ๐ค
Best,
AndriaK
Yeah super happy to review your PR!! ๐ค
Hi @ArthurZucker ๐
Thanks for the encouragement! Iโve just submitted the PR here: https://github.com/huggingface/tokenizers/pull/1822 Title: Faster Whitespace PreTokenizer (Drop-in Replacement)
It includes:
- Identical behavior to the original
- 10โ30% performance gains across input lengths
- Full unit test coverage
Would love your feedback when you have a moment ๐
Best, AndriaK