Arthur comments

Results 795 comments of


                                            Arthur

[Doctests] Refactor doctests + add CI

Warning will be imporved

[Doctests] Refactor doctests + add CI

What should be detailed is that only the codeblocks (and not the entire file) should be skipped. This might be why longt5 is not skipped! I’ll be off for a...

Add UDOP

not sure it does no! The added tokens was the issue if I remember correctly

Add UDOP

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be...

Add UDOP

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with `spm` and `AddedToken` being taken care of!

Add UDOP

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

Add UDOP

Regarding the priority, not really sure. I won't really have time to dive deep in this before a few weeks. If a contributor wants to work on this feel free...

Add UDOP

Will have a look and try to re-upload a working tokenizer!

How I added the tokenizer: (removed the convert token to id logic of regexes) ```python >>> from transformers import UdopTokenizer >>> tokenizer = UdopTokenizer("ArthurZ/udop/spiece.model") >>> tokenizer.add_tokens(tokenizer.additional_special_tokens) ``` this currently gives...

LlamaTokenizer should follow signature of PreTrainedTokenizer

The default `eos_token` and `bos_tokens` are there because the `sentence piece` model has these set, which means we are following the `llama` implementation. Having `add_eos` and `add_beo` gives the flexibility...