BERT-pytorch
BERT-pytorch copied to clipboard
Default model sizes are much smaller than BERT base
The base BERT model in https://arxiv.org/pdf/1810.04805.pdf uses 768 hidden features, 12 layers, 12 heads (which are also the defaults in bert.py
), while the default configuration in the argparser of __main__.py
uses 256/8/8. Would it make sense to align the example script with the paper? I spent quite a while puzzling over my low GPU utilization with the default configuration. Thanks!