TUPE
TUPE copied to clipboard
What's the format of the raw data?
Nice work!
I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?
That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.
@Howal It is the raw text format. the wiki data is the output of wikiextractor.
We don't specially handle the newline
token, just keep it as it is.
The first lines of wiki data.
For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data. For wiki data, you can easily download it.
Thank you!
hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt" I have got one version of bookcorpus data, but the format looks different from yours. Could you tell me is there any relationship between these two files or is the whole data of bookcorpus just stored separately in these two files ?
@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately.