scibert icon indicating copy to clipboard operation
scibert copied to clipboard

SciBERT Corpus Availability?

Open CyndxAI opened this issue 4 years ago • 2 comments

Hi, do you plan to make the pretraining corpus available, or provide a way to reproduce / approximate it using Semantic Scholar?

CyndxAI avatar May 21 '20 00:05 CyndxAI

Hey @CyndxAI unfortunately the SciBERT pretraining corpus is not publicly available. If you're interested in a large pretraining corpus for training these large language models, I can point you to another project from our team: https://github.com/allenai/s2orc, which provides 70M+ paper abstracts and 8M+ full text papers. Should be plenty of text to train on. If you check out preprint: https://arxiv.org/pdf/1911.02782.pdf you can see that we reproduced SciBERT results using this corpus.

kyleclo avatar May 21 '20 01:05 kyleclo

Perfect, that works. Thank you!

CyndxAI avatar May 21 '20 01:05 CyndxAI