tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Results 407 tokenizers issues
Sort by recently updated
recently updated
newest added

Hi! I see that the following file declares `BertWordPieceTokenizer`, which cannot be found in the documentation: maybe its me or a PR was accepted without the documentation being checked? I...

If a sentence is tokenized with the XLM-Roberta fast tokenizer, the offset mapping is 1 off if one of the subwords is only a space. Example: > from transformers import...

Update derive_builder to 0.10 in cargo.toml. Related to issue #1029

Tokenizers currently uses derive_builder 0.9 in rust. There is a 0.10 release for it out which should not cause any breaking changes. Would be nice to get it updated.

Hi @n1t0 , I want to represent numbers with the token `[NUM]`. A specific regex normalizer which keeps only alphanumeric characters causes tokenizer not to identify this token. This is...

Tokenizers supports `return_overflowing_tokens=True`, which yields multiple token sequences per input string. When used under `Dataset.map`, this requires dropping the original columns, as documented at https://huggingface.co/docs/datasets/about_map_batch . This means that the...

Hi, I find that the tokenizers for OPT models have possibly wrong "special_tokens_map": ``` >>> from transformers import GPT2Tokenizer >>> tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m") >>> tokenizer.special_tokens_map {'bos_token': '', 'eos_token': '', 'unk_token':...

I am trying to tokenize text by loading a vocab in huggingface. ``` vocab_path = '....' ## have a local vocab path tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=False) text = 'The...

Wasm support would be a cool issue. - [ ] Add a feature flag `wasm`. - [ ] Use `esaxx_rs::suffix_rs` instead of `esaxx_rs::suffix` (Maybe with a change in the crate...

Hi there, I tried to use the function .get_new_tokens() on a very simple example but got this error: TypeError: Can't convert to Sequence the code is: from transformers_domain_adaptation import VocabAugmentor...