Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

Support for `pad_encodings` in the Python API

Ok, I see what you mean, and indeed if the tokenizer already knows about the padding value it's definitely something to consider in terms of internal information not leaking as...

Tokenizer throwing PanicException

Hi @tanmaylaud, Can you provide a script that triggered the error maybe? Or some more context? Without it it's a bit hard to help. cheers

Tokenizer throwing PanicException

Error seems to be located on the esaxx (suffix array) call, which is most likely a C++ error (code was taken directly from sentencepiece for this one). Could you add...

Tokenizer throwing PanicException

Ok, it's what I said, Internal error means the error occurs within the cpp code, Can you build from source ? ``` git clone https://github.com/huggingface/tokenizers cd tokenizers/bindings/python pip install -e...

Tokenizer throwing PanicException

I expect this error to be linked to overflowing `i32`, which unfortunately we don't support `u64` in `tokenizers`. If you are able to rebuild, you could look into making a...

Tokenizer throwing PanicException

The first one is to make sure that's the case, the second would be to solve it. My first priority would be to confirm the intuition is correct, and only...

Tokenizer throwing PanicException

Do you mind also sharing your tokenizer config ? (pre_tokenizers, normalizers etc..) They ahve a big impact of the numbers so probability of overflow.

Update derive_builder to 0.10 in cargo.toml

Seems it's not as innocuous as it appears. :(

Pretrained BertWordPieceTokenizer loads with different parameters

Hi @ulyanaisaeva @codemurt , This is one of the more shady parts of this library unfortunately. `tokenizer.train` actually uses another object call the `trainer` which might not see some of...

vocab_size issue with Whitespace pre_tokenizer

This example seems to work correctly: ```python from tokenizers import Tokenizer, models, pre_tokenizers, normalizers, trainers import datasets def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i : i +...