TUPE icon indicating copy to clipboard operation
TUPE copied to clipboard

What's the format of the raw data?

Open Howal opened this issue 4 years ago • 4 comments

Nice work!

I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?

That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.

Howal avatar Aug 03 '20 09:08 Howal

@Howal It is the raw text format. the wiki data is the output of wikiextractor. We don't specially handle the newline token, just keep it as it is. The first lines of wiki data. image

For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data. For wiki data, you can easily download it.

guolinke avatar Aug 03 '20 12:08 guolinke

Thank you!

Howal avatar Aug 04 '20 04:08 Howal

hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt" I have got one version of bookcorpus data, but the format looks different from yours. Could you tell me is there any relationship between these two files or is the whole data of bookcorpus just stored separately in these two files ?

sowhatyc avatar May 31 '21 10:05 sowhatyc

@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately.

guolinke avatar Jun 02 '21 02:06 guolinke