Pretraining not generating added tokens file

Open saishashank85 opened this issue 5 years ago • 1 comments

Hi , 1. I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model . So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert . I've taken a look at the lm_train and lm_test files and could'nt make out the format used .

There is also a test % split parameter for the learner_lm model . what does the % refer to ? From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .

Is it also possible to add support for training whole word masking and skip thoughts in further versions .

Thanks in advance!

Apr 28 '20 15:04 saishashank85

It might be best to submit single issues at a time and label them appropriately as bug/enhancement/help wanted, etc.

May 08 '20 18:05 aaronbriel