BERT-pytorch icon indicating copy to clipboard operation
BERT-pytorch copied to clipboard

Making Book Corpus

Open codertimo opened this issue 7 years ago • 5 comments

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

codertimo avatar Oct 30 '18 05:10 codertimo

#32

codertimo avatar Oct 30 '18 05:10 codertimo

The original paper (BERT) use "the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)." what do you mean "Movie Corpus"?

mapingshuo avatar Oct 30 '18 07:10 mapingshuo

@mapingshuo Sorry It's my fault. haha I just made that title in 5seconds :) thank you!! 👍

codertimo avatar Oct 30 '18 07:10 codertimo

That's okay, I am looking for a valid Book Corpus too.

mapingshuo avatar Oct 30 '18 08:10 mapingshuo

Both GPT and BERT were trained on bookscorpus. Presumably there's a private copy people are passing about. There's some web scrapers out there designed for recreating the bookscorpus but this repetition of work seems unnecessary. If anyone finds a copy, do let me know!

Henry-E avatar Jan 11 '19 12:01 Henry-E