K-BERT
K-BERT copied to clipboard
pre-training corpus
Hello @autoliuweijie, thank you for your amazing and inspiring work!
I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate()
to run, with args.target
set to "bert"
. I notice that with this setting, BertDataLoader
will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path
has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.
It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?