Nicolas Patry
Nicolas Patry
Ok, I see what you mean, and indeed if the tokenizer already knows about the padding value it's definitely something to consider in terms of internal information not leaking as...
Hi @tanmaylaud, Can you provide a script that triggered the error maybe? Or some more context? Without it it's a bit hard to help. cheers
Error seems to be located on the esaxx (suffix array) call, which is most likely a C++ error (code was taken directly from sentencepiece for this one). Could you add...
Ok, it's what I said, Internal error means the error occurs within the cpp code, Can you build from source ? ``` git clone https://github.com/huggingface/tokenizers cd tokenizers/bindings/python pip install -e...
I expect this error to be linked to overflowing `i32`, which unfortunately we don't support `u64` in `tokenizers`. If you are able to rebuild, you could look into making a...
The first one is to make sure that's the case, the second would be to solve it. My first priority would be to confirm the intuition is correct, and only...
Do you mind also sharing your tokenizer config ? (pre_tokenizers, normalizers etc..) They ahve a big impact of the numbers so probability of overflow.
Seems it's not as innocuous as it appears. :(
Hi @ulyanaisaeva @codemurt , This is one of the more shady parts of this library unfortunately. `tokenizer.train` actually uses another object call the `trainer` which might not see some of...
This example seems to work correctly: ```python from tokenizers import Tokenizer, models, pre_tokenizers, normalizers, trainers import datasets def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i : i +...