gpt-2 icon indicating copy to clipboard operation
gpt-2 copied to clipboard

Any plans to release WebText corpus?

Open nmstoker opened this issue 6 years ago • 10 comments

I've seen #16 and appreciate the valid concerns raised about releasing the model, but the WebText corpus could be a tremendous help to general research if you were able to release it.

Are there plans to do so?

I did wonder if this might simply enable people to recreate the unreleased GPT-2 but presumably this is no trivial matter, needing expertise and time/resources, thus deterring the causal mischief maker!

Anyway, whatever you end up doing, I wanted to thank you for what you have released already which is really interesting 🙂

nmstoker avatar Feb 15 '19 13:02 nmstoker

@nmstoker according to the paper, they are using https://github.com/codelucas/newspaper to grab news data. By the way there are several news corpus out there, I'm not sure what is new in this specific corpus actually.

loretoparisi avatar Feb 15 '19 19:02 loretoparisi

Thank you, yes I saw that (and have used it myself before). Perhaps I'm mistaken but I didn't understand that they were using it for news specifically, as you imply. However the point is it's quite an undertaking to gather 40Gb of high quality diverse language content and it would save everyone repeating this exercise (similar motivations are behind the Common Crawl)

nmstoker avatar Feb 15 '19 23:02 nmstoker

@nmstoker yes it can be tricky. For Wikipedia it's easier, I'm typically using this Facebook script from FastText, that extract the dump in a certain language, normalize the text and store it as a text file (from the original xml file).

loretoparisi avatar Feb 16 '19 15:02 loretoparisi

Would you be willing to provide further details about the heuristic based cleaning and deduplication from the paper? Specifically what heuristics were used and what method of deduplication was used?

Skylion007 avatar Mar 05 '19 18:03 Skylion007

We went ahead and did our best to replicate it: https://skylion007.github.io/OpenWebTextCorpus/

Skylion007 avatar May 02 '19 01:05 Skylion007

Does the WebText corpus include patent data, say data from the USPTO or Google Patents?

leejason avatar May 06 '19 08:05 leejason

Any news on this?

aliakhtar avatar Mar 11 '20 17:03 aliakhtar

We have found a good alternative using Inria OSCAR.

loretoparisi avatar Mar 12 '20 10:03 loretoparisi

Looks interesting @loretoparisi.

Am guessing even with compression those download files will be huge.

How have you found downloading them - did it take days?

nmstoker avatar Mar 12 '20 13:03 nmstoker

@nmstoker yes we have used OSCAR to train a BERT based LM - https://github.com/musixmatchresearch/umberto To download it's like for Italian 69GB so you need fiber channel for sure.

loretoparisi avatar Mar 13 '20 11:03 loretoparisi