AndriaK

Results 4 issues of AndriaK

🚀 Faster Whitespace PreTokenizer (Drop-in Replacement) This PR replaces the current Whitespace pre-tokenizer implementation with an optimized version that achieves consistent 10–30% performance improvements across short, medium, and long inputs...

**Describe the bug** I am unable to download the necessary test data for `added_tokens.rs` and other integration tests. Running `cargo test --test added_tokens` results in "Files not found" errors, specifically:...

Hi Hugging Face Tokenizers Team, I’ve been exploring the `whitespace.rs` pre-tokenizer code and noticed it relies on regex for splitting tokens by whitespace. I wanted to propose replacing this regex-based...

Hi Hugging Face team 👋, I’d like to propose replacing the current Whitespace PreTokenizer in tokenizers with a faster implementation I developed. It achieves consistent 10–30% performance improvements across short,...