Arthur

Results 795 comments of Arthur

What should be detailed is that only the codeblocks (and not the entire file) should be skipped. This might be why longt5 is not skipped! I’ll be off for a...

not sure it does no! The added tokens was the issue if I remember correctly

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be...

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with `spm` and `AddedToken` being taken care of!

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

Regarding the priority, not really sure. I won't really have time to dive deep in this before a few weeks. If a contributor wants to work on this feel free...

Will have a look and try to re-upload a working tokenizer!

How I added the tokenizer: (removed the convert token to id logic of regexes) ```python >>> from transformers import UdopTokenizer >>> tokenizer = UdopTokenizer("ArthurZ/udop/spiece.model") >>> tokenizer.add_tokens(tokenizer.additional_special_tokens) ``` this currently gives...

The default `eos_token` and `bos_tokens` are there because the `sentence piece` model has these set, which means we are following the `llama` implementation. Having `add_eos` and `add_beo` gives the flexibility...