tokenizers Proposal: Faster `Whitespace` PreTokenizer Implementation (10

Hi Hugging Face team 👋,

I’d like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10–30% performance improvements across short, medium, and long inputs, while preserving identical output behavior.

🚀 Why This Matters

Whitespace is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a compounding benefit at scale, especially in multi-threaded, batched workflows.

⚡ Benchmarks (Criterion)

I benchmarked both implementations across multiple runs with consistent patterns:

🧪 Inputs

Short: e.g., "Hello world!" (~10–20 characters)
Medium: typical sentences (~100–150 characters)
Long: paragraphs or documents (~5,000+ characters)

✅ Optimized Version (mine)

Input Type	Time (avg)	Change
Short	549–559 ns	10–15% faster
Medium	3.86–4.01 µs	5–30% faster
Long	50.8–71 µs	5–15% faster, more stable

🧬 Output Compatibility

Produces the same pre-tokenization splits as the original
Word boundaries, punctuation, and whitespace are handled identically
Includes unit tests that confirm offset and string correctness

🔧 Technical Summary

Replaces regex-based character matching with a manual char_indices() loop
Classifies spans as word, whitespace, or punctuation without allocations
No external dependencies
Cleaner and more cache-friendly structure
Fully backward compatible, including impl_serde_type!

📦 Integration Options

I'd be happy to:

Submit a PR replacing the current implementation
Or submit it alongside as WhitespaceFast for side-by-side evaluation

Thanks again for maintaining this fantastic library. Let me know your preferences and I’ll submit the PR accordingly! 🤗

Best,
AndriaK

Jul 07 '25 07:07 8ria

Yeah super happy to review your PR!! 🤗

Jul 07 '25 09:07 ArthurZucker

Hi @ArthurZucker 👋

Thanks for the encouragement! I’ve just submitted the PR here: https://github.com/huggingface/tokenizers/pull/1822 Title: Faster Whitespace PreTokenizer (Drop-in Replacement)

It includes:

Identical behavior to the original
10–30% performance gains across input lengths
Full unit test coverage

Would love your feedback when you have a moment 🙏

Best, AndriaK

Jul 07 '25 11:07 8ria

Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup)

🚀 Why This Matters

⚡ Benchmarks (Criterion)

🧪 Inputs

✅ Optimized Version (mine)

🧬 Output Compatibility

🔧 Technical Summary

📦 Integration Options