Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

pyo3_runtime.PanicException: Missing additional token

I will reopen this issue if you don't mind since the easy fix, works but is not the end of it. IMHO, the code you submitted should work out of...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Hi @catqaq , do you mind sharing the exact script you created with the doc ? Also are you using the exact data of the script ? Do you mind...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

> it's a little inconvenient that we can't get expected vocab size easily As mentionned in the linked issue, if you trigger that behavior based on number of chars alone,...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Yes, I see what you mean. There is some other work that might enable something to be workable with a byte hack too that might enable stricter vocab_size enforcement without...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Hmm, using pure bytes as a source vocabulary is definittely better as 256 would be the min vocab, and nothing else would be necessary. The main drawback with this approach...

Nicolas Patry

pyo3_runtime.PanicException: Missing additional token

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Exception upon attempting to load a Tokenizer from file

Exception upon attempting to load a Tokenizer from file

Exception upon attempting to load a Tokenizer from file

Support for `pad_encodings` in the Python API

Support for `pad_encodings` in the Python API