Dat Quoc Nguyen comments

Results 22 comments of


                                            Dat Quoc Nguyen

Add fast tokenizer for BARTpho

Following: [https://github.com/huggingface/transformers/pull/13788](https://github.com/huggingface/transformers/pull/13788) I now add a "fast" version of the BartphoTokenizer. @sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten Please could you have a look and provide your feedback? Thanks.

Add fast tokenizer for BARTpho

Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements. Please have a look and give feedback. Thanks. cc: @SaulLu @LysandreJik

Add fast tokenizer for BARTpho

Please note that the unsuccessful checks are due to the failed `test_modeling_wav2vec2_conformer.py`, not related to our BartphoTokenizer. @SaulLu

Add fast tokenizer for BARTpho

@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it...

Add fast tokenizer for BARTpho

Hi @SaulLu , @sgugger , @patil-suraj @LysandreJik and @patrickvonplaten In addition to a fast BARTpho tokenizer, I also revised my code to add fast tokenizers for BERTweet and PhoBERT. Here,...

Add fast tokenizer for BARTpho

@SaulLu Thank you very much for your detailed feedback and suggestion. Before moving forward to revise the code w.r.t. the `add_tokens` feature, it would be great if you could provide...

Add fast tokenizer for BARTpho

@SaulLu Similarly, for monolingual models PhoBERT for Vietnamese and BERTweet for English, vocabularies of 64K subword types should be more than enough, so that we might not need to use...

Add fast tokenizer for BARTpho

@SaulLu Thank you very much for your feedback. I improved the hacking strategy to handle the issue with newly added tokens. Assume that the sizes of the multilingual and monolingual...

Add fast tokenizer for BARTpho

> Hi @datquocnguyen. It's amazing that you added those two new fast tokenizers. However we need PRs to be focused on one thing. Would you terribly mind splitting it in...

Add fast tokenizer for BARTpho

@SaulLu please help to review [the improved strategy](https://github.com/huggingface/transformers/pull/17254#issuecomment-1139492485) and give feedback. Thank you very much. Please note that failed checks are not related to my bartpho tokenizer, except for one...