TUPE What's the format of the raw data?

What's the format of the raw data?

Open Howal opened this issue 4 years ago • 4 comments

Nice work!

I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?

That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.

Aug 03 '20 09:08 Howal

@Howal It is the raw text format. the wiki data is the output of wikiextractor. We don't specially handle the newline token, just keep it as it is. The first lines of wiki data.

For the data, unfortunately, we cannot distribute it, due to the license issue of Book Corpus data. For wiki data, you can easily download it.

Aug 03 '20 12:08 guolinke

Thank you!

Aug 04 '20 04:08 Howal

hi, @guolinke , in ./preprocess/pretrain/process.sh, I saw the bookcorpus data stored in two files BOOK_RAW="$DATA_DIR/book_corpus_epub.txt $DATA_DIR/book_corpus_txt.txt" I have got one version of bookcorpus data, but the format looks different from yours. Could you tell me is there any relationship between these two files or is the whole data of bookcorpus just stored separately in these two files ？

May 31 '21 10:05 sowhatyc

@sowhatyc we crawl the book corpus by our own. there are two formats: txt and epub. and we save them separately.

Jun 02 '21 02:06 guolinke

TUPE TUPE copied to clipboard

What's the format of the raw data?

TUPE
TUPE copied to clipboard