scibert
scibert copied to clipboard
SciBERT Corpus Availability?
Hi, do you plan to make the pretraining corpus available, or provide a way to reproduce / approximate it using Semantic Scholar?
Hey @CyndxAI unfortunately the SciBERT pretraining corpus is not publicly available. If you're interested in a large pretraining corpus for training these large language models, I can point you to another project from our team: https://github.com/allenai/s2orc, which provides 70M+ paper abstracts and 8M+ full text papers. Should be plenty of text to train on. If you check out preprint: https://arxiv.org/pdf/1911.02782.pdf you can see that we reproduced SciBERT results using this corpus.
Perfect, that works. Thank you!