Dat Quoc Nguyen

Results 22 comments of Dat Quoc Nguyen

Following: [https://github.com/huggingface/transformers/pull/13788](https://github.com/huggingface/transformers/pull/13788) I now add a "fast" version of the BartphoTokenizer. @sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten Please could you have a look and provide your feedback? Thanks.

Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements. Please have a look and give feedback. Thanks. cc: @SaulLu @LysandreJik

Please note that the unsuccessful checks are due to the failed `test_modeling_wav2vec2_conformer.py`, not related to our BartphoTokenizer. @SaulLu

@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it...

Hi @SaulLu , @sgugger , @patil-suraj @LysandreJik and @patrickvonplaten In addition to a fast BARTpho tokenizer, I also revised my code to add fast tokenizers for BERTweet and PhoBERT. Here,...

@SaulLu Thank you very much for your detailed feedback and suggestion. Before moving forward to revise the code w.r.t. the `add_tokens` feature, it would be great if you could provide...

@SaulLu Similarly, for monolingual models PhoBERT for Vietnamese and BERTweet for English, vocabularies of 64K subword types should be more than enough, so that we might not need to use...

@SaulLu Thank you very much for your feedback. I improved the hacking strategy to handle the issue with newly added tokens. Assume that the sizes of the multilingual and monolingual...

> Hi @datquocnguyen. It's amazing that you added those two new fast tokenizers. However we need PRs to be focused on one thing. Would you terribly mind splitting it in...

@SaulLu please help to review [the improved strategy](https://github.com/huggingface/transformers/pull/17254#issuecomment-1139492485) and give feedback. Thank you very much. Please note that failed checks are not related to my bartpho tokenizer, except for one...