AndriaK issues

Results 4 issues of


                                            AndriaK

Faster Whitespace PreTokenizer (Drop-in Replacement)

🚀 Faster Whitespace PreTokenizer (Drop-in Replacement) This PR replaces the current Whitespace pre-tokenizer implementation with an optimized version that achieves consistent 10–30% performance improvements across short, medium, and long inputs...

Cannot download test data: 'make test' and direct links fail with "Repository not found" / 404

**Describe the bug** I am unable to download the necessary test data for `added_tokens.rs` and other integration tests. Running `cargo test --test added_tokens` results in "Files not found" errors, specifically:...

Proposal: Replace regex in `whitespace.rs` with manual code for speed improvements

Hi Hugging Face Tokenizers Team, I’ve been exploring the `whitespace.rs` pre-tokenizer code and noticed it relies on regex for splitting tokens by whitespace. I wanted to propose replacing this regex-based...

Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup)

Hi Hugging Face team 👋, I’d like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10–30% performance improvements across short,...