Nicolas Patry comments

Results 978 comments of


                                            Nicolas Patry

Merges cannot handle tokens containing spaces.

Yes it's your postprocessing that is most likely causing the issue.

Sampling alternative tokenizations

Yes that's something like log probability.

Sampling alternative tokenizations

Unfoprtunately, activity on this is issue seems to suggest this is a very low demand feature, very low probability to see the light anytime soon. I'm going to close it....

Sampling alternative tokenizations

Most users are using `tokenizers` at inference, not so much at training. There is dropout for BPE too, which I don't see a lot of issues about (not sure if...

Tokenizer throwing PanicException

Do you have a reproducible script ? It sounds like a buffer overflow. This library only uses `u32` for most of counting things, meaning that large dataset (especially without careful...

Addition of CONTRIBUTING.md to Repository

Good first issues not really. The biggest long standing request was rewriting the node bindings using latest neon to get support from latest node versions. I have zero idea how...

Exception upon attempting to load a Tokenizer from file

Have you checked out the PR that fixes it ? https://github.com/huggingface/tokenizers/pull/909 Which not going to merge anytime soon since it changes the on-disk format of the tokenizer, so we need...

Exception upon attempting to load a Tokenizer from file

Can you check your `tokenizers` versions ? I think they are not the same major. (probably). `tokenizers` is designed to be backwards compatible, but you're talking here about forward compatibility...

Exception upon attempting to load a Tokenizer from file

> should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily? There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not...

Exception upon attempting to load a Tokenizer from file

I mean that the encodings are exactly the same on a larger enough subset of text. (tokenizer.encode(mystring))