Nicolas Patry
Nicolas Patry
Yes it's your postprocessing that is most likely causing the issue.
Yes that's something like log probability.
Unfoprtunately, activity on this is issue seems to suggest this is a very low demand feature, very low probability to see the light anytime soon. I'm going to close it....
Most users are using `tokenizers` at inference, not so much at training. There is dropout for BPE too, which I don't see a lot of issues about (not sure if...
Do you have a reproducible script ? It sounds like a buffer overflow. This library only uses `u32` for most of counting things, meaning that large dataset (especially without careful...
Good first issues not really. The biggest long standing request was rewriting the node bindings using latest neon to get support from latest node versions. I have zero idea how...
Have you checked out the PR that fixes it ? https://github.com/huggingface/tokenizers/pull/909 Which not going to merge anytime soon since it changes the on-disk format of the tokenizer, so we need...
Can you check your `tokenizers` versions ? I think they are not the same major. (probably). `tokenizers` is designed to be backwards compatible, but you're talking here about forward compatibility...
> should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily? There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not...
I mean that the encodings are exactly the same on a larger enough subset of text. (tokenizer.encode(mystring))