Pretraining not generating added tokens file
Hi , 1. I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model . So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert . I've taken a look at the lm_train and lm_test files and could'nt make out the format used .
There is also a test % split parameter for the learner_lm model . what does the % refer to ? From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .
Is it also possible to add support for training whole word masking and skip thoughts in further versions .
Thanks in advance!
It might be best to submit single issues at a time and label them appropriately as bug/enhancement/help wanted, etc.