Dat Quoc Nguyen
Dat Quoc Nguyen
Following: [https://github.com/huggingface/transformers/pull/13788](https://github.com/huggingface/transformers/pull/13788) I now add a "fast" version of the BartphoTokenizer. @sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten Please could you have a look and provide your feedback? Thanks.
Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements. Please have a look and give feedback. Thanks. cc: @SaulLu @LysandreJik
Please note that the unsuccessful checks are due to the failed `test_modeling_wav2vec2_conformer.py`, not related to our BartphoTokenizer. @SaulLu
@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it...
Hi @SaulLu , @sgugger , @patil-suraj @LysandreJik and @patrickvonplaten In addition to a fast BARTpho tokenizer, I also revised my code to add fast tokenizers for BERTweet and PhoBERT. Here,...
@SaulLu Thank you very much for your detailed feedback and suggestion. Before moving forward to revise the code w.r.t. the `add_tokens` feature, it would be great if you could provide...
@SaulLu Similarly, for monolingual models PhoBERT for Vietnamese and BERTweet for English, vocabularies of 64K subword types should be more than enough, so that we might not need to use...
@SaulLu Thank you very much for your feedback. I improved the hacking strategy to handle the issue with newly added tokens. Assume that the sizes of the multilingual and monolingual...
> Hi @datquocnguyen. It's amazing that you added those two new fast tokenizers. However we need PRs to be focused on one thing. Would you terribly mind splitting it in...
@SaulLu please help to review [the improved strategy](https://github.com/huggingface/transformers/pull/17254#issuecomment-1139492485) and give feedback. Thank you very much. Please note that failed checks are not related to my bartpho tokenizer, except for one...