Nicolas Patry
Nicolas Patry
I will reopen this issue if you don't mind since the easy fix, works but is not the end of it. IMHO, the code you submitted should work out of...
Hi @catqaq , do you mind sharing the exact script you created with the doc ? Also are you using the exact data of the script ? Do you mind...
> it's a little inconvenient that we can't get expected vocab size easily As mentionned in the linked issue, if you trigger that behavior based on number of chars alone,...
Yes, I see what you mean. There is some other work that might enable something to be workable with a byte hack too that might enable stricter vocab_size enforcement without...
Hmm, using pure bytes as a source vocabulary is definittely better as 256 would be the min vocab, and nothing else would be necessary. The main drawback with this approach...
@duskybomb Does the problem still exist on latest `0.12.1` ? I can't seem to reproduce.
Do you have a simple reproducible script ? here is the script I tried to use to reproduce, but it seems to be working properly ````python from tokenizers import trainers,...
hi @yechong316 , It seems your file contains merges which are not acceptable in the current deployed version of `tokenizers`. Those merges contain multiple spaces: `"e s "` for instance...
Hi @LoicGrobol > Store them in List[str] form, which is not very satisfying because it requires encoding before batching (potential bottleneck and duplication of work) Do you have an example...
> This is quite fast, because everything is already encoded when we get to 2. because we just have to manipulate tensors and these are easy to use in a...