tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Proposal: Faster `Whitespace` PreTokenizer Implementation (10โ€“30% Speedup)

Open 8ria opened this issue 5 months ago โ€ข 2 comments

Hi Hugging Face team ๐Ÿ‘‹,

Iโ€™d like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10โ€“30% performance improvements across short, medium, and long inputs, while preserving identical output behavior.


๐Ÿš€ Why This Matters

Whitespace is a foundational component used in many pipelines, especially in LLM pretraining, tokenization benchmarks, and inference preprocessing. Any improvement here brings a compounding benefit at scale, especially in multi-threaded, batched workflows.


โšก Benchmarks (Criterion)

I benchmarked both implementations across multiple runs with consistent patterns:

๐Ÿงช Inputs

  • Short: e.g., "Hello world!" (~10โ€“20 characters)

  • Medium: typical sentences (~100โ€“150 characters)

  • Long: paragraphs or documents (~5,000+ characters)


โœ… Optimized Version (mine)

Input Type Time (avg) Change
Short 549โ€“559 ns 10โ€“15% faster
Medium 3.86โ€“4.01 ยตs 5โ€“30% faster
Long 50.8โ€“71 ยตs 5โ€“15% faster, more stable

๐Ÿงฌ Output Compatibility

  • Produces the same pre-tokenization splits as the original

  • Word boundaries, punctuation, and whitespace are handled identically

  • Includes unit tests that confirm offset and string correctness


๐Ÿ”ง Technical Summary

  • Replaces regex-based character matching with a manual char_indices() loop

  • Classifies spans as word, whitespace, or punctuation without allocations

  • No external dependencies

  • Cleaner and more cache-friendly structure

  • Fully backward compatible, including impl_serde_type!


๐Ÿ“ฆ Integration Options

I'd be happy to:

  • Submit a PR replacing the current implementation

  • Or submit it alongside as WhitespaceFast for side-by-side evaluation


Thanks again for maintaining this fantastic library. Let me know your preferences and Iโ€™ll submit the PR accordingly! ๐Ÿค—

Best,
AndriaK

8ria avatar Jul 07 '25 07:07 8ria

Yeah super happy to review your PR!! ๐Ÿค—

ArthurZucker avatar Jul 07 '25 09:07 ArthurZucker

Hi @ArthurZucker ๐Ÿ‘‹

Thanks for the encouragement! Iโ€™ve just submitted the PR here: https://github.com/huggingface/tokenizers/pull/1822 Title: Faster Whitespace PreTokenizer (Drop-in Replacement)

It includes:

  • Identical behavior to the original
  • 10โ€“30% performance gains across input lengths
  • Full unit test coverage

Would love your feedback when you have a moment ๐Ÿ™

Best, AndriaK

8ria avatar Jul 07 '25 11:07 8ria