tokenizers issues

Missing documentation for `BertWordPieceTokenizer`

1

Hi! I see that the following file declares `BertWordPieceTokenizer`, which cannot be found in the documentation: maybe its me or a PR was accepted without the documentation being checked? I...

BlueskyFR

XLM-Roberta offset mapping is off by one in case of whitespace-subwords

3

If a sentence is tokenized with the XLM-Roberta fast tokenizer, the offset mapping is 1 off if one of the subwords is only a space. Example: > from transformers import...

robvanderg

Update derive_builder to 0.10 in cargo.toml

1

Update derive_builder to 0.10 in cargo.toml. Related to issue #1029

kaustavha

Please update derive_builder to 0.10

Tokenizers currently uses derive_builder 0.9 in rust. There is a 0.10 release for it out which should not cause any breaking changes. Would be nice to get it updated.

kaustavha

Problem adding token with a specific replace normalizer

2

Hi @n1t0 , I want to represent numbers with the token `[NUM]`. A specific regex normalizer which keeps only alphanumeric characters causes tokenizer not to identify this token. This is...

sadra-barikbin

How to preserve original dataset fields when tokenizing with overflow?

Tokenizers supports `return_overflowing_tokens=True`, which yields multiple token sequences per input string. When used under `Dataset.map`, this requires dropping the original columns, as documented at https://huggingface.co/docs/datasets/about_map_batch . This means that the...

srobertjames

OPT Tokenizers have wrong "special_tokens_map"

1

Hi, I find that the tokenizers for OPT models have possibly wrong "special_tokens_map": ``` >>> from transformers import GPT2Tokenizer >>> tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m") >>> tokenizer.special_tokens_map {'bos_token': '', 'eos_token': '', 'unk_token':...

chengxuz

Unable to get Camel case tokens after tokenization in huggingface

1

I am trying to tokenize text by loading a vocab in huggingface. ``` vocab_path = '....' ## have a local vocab path tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=False) text = 'The...

pjhamb

Support `wasm`

9

Wasm support would be a cool issue. - [ ] Add a feature flag `wasm`. - [ ] Use `esaxx_rs::suffix_rs` instead of `esaxx_rs::suffix` (Maybe with a change in the crate...

Narsil

Can't convert <tokenizers.trainers.WordPieceTrainer object at 0x173caa2b0> to Sequence

1

Hi there, I tried to use the function .get_new_tokens() on a very simple example but got this error: TypeError: Can't convert to Sequence the code is: from transformers_domain_adaptation import VocabAugmentor...

Eleo22

tokenizers
tokenizers copied to clipboard

Metadata

Missing documentation for `BertWordPieceTokenizer`

XLM-Roberta offset mapping is off by one in case of whitespace-subwords

Update derive_builder to 0.10 in cargo.toml

Please update derive_builder to 0.10

Problem adding token with a specific replace normalizer

How to preserve original dataset fields when tokenizing with overflow?

OPT Tokenizers have wrong "special_tokens_map"

Unable to get Camel case tokens after tokenization in huggingface

Support `wasm`

Can't convert <tokenizers.trainers.WordPieceTrainer object at 0x173caa2b0> to Sequence

← Metadata

Owner

Metadata

tokenizers tokenizers copied to clipboard

Metadata

← Metadata

Owner

Metadata

tokenizers
tokenizers copied to clipboard