K-BERT icon indicating copy to clipboard operation
K-BERT copied to clipboard

pre-training corpus

Open Humorloos opened this issue 2 years ago • 0 comments

Hello @autoliuweijie, thank you for your amazing and inspiring work!

I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate() to run, with args.target set to "bert". I notice that with this setting, BertDataLoader will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.

It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?

Humorloos avatar Jun 02 '22 10:06 Humorloos