AzureML-BERT icon indicating copy to clipboard operation
AzureML-BERT copied to clipboard

Preprocessed data are in the wrong path 512/wikipedia_pretrain.

Open kaiidams opened this issue 6 years ago • 2 comments

BERT_pretrain.ipynb instructs to download https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz for the preprocessed data. The tar file contains data in 512/wikipedia_pretrain, but it should be 512/wiki_pretrain.

kaiidams avatar Sep 10 '19 01:09 kaiidams

The serialized data wikipedia_segment ed_part_NN.bin refer WikiNBookCorpusPretrainingDataCreator which has been deleted in the latest code. Adding the following can avoid the issue.

class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator):
    pass

kaiidams avatar Sep 10 '19 03:09 kaiidams

@kaiidams thanks for reporting this issue. We will update the tar file soon. In the meantime, download and use the data referenced in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data and you will not need the deleted file for loading the data.

skaarthik avatar Sep 25 '19 05:09 skaarthik