gpt-2
gpt-2 copied to clipboard
Any plans to release WebText corpus?
I've seen #16 and appreciate the valid concerns raised about releasing the model, but the WebText corpus could be a tremendous help to general research if you were able to release it.
Are there plans to do so?
I did wonder if this might simply enable people to recreate the unreleased GPT-2 but presumably this is no trivial matter, needing expertise and time/resources, thus deterring the causal mischief maker!
Anyway, whatever you end up doing, I wanted to thank you for what you have released already which is really interesting 🙂
@nmstoker according to the paper, they are using https://github.com/codelucas/newspaper to grab news data. By the way there are several news corpus out there, I'm not sure what is new in this specific corpus actually.
Thank you, yes I saw that (and have used it myself before). Perhaps I'm mistaken but I didn't understand that they were using it for news specifically, as you imply. However the point is it's quite an undertaking to gather 40Gb of high quality diverse language content and it would save everyone repeating this exercise (similar motivations are behind the Common Crawl)
@nmstoker yes it can be tricky. For Wikipedia it's easier, I'm typically using this Facebook script from FastText, that extract the dump in a certain language, normalize the text and store it as a text file (from the original xml file).
Would you be willing to provide further details about the heuristic based cleaning and deduplication from the paper? Specifically what heuristics were used and what method of deduplication was used?
We went ahead and did our best to replicate it: https://skylion007.github.io/OpenWebTextCorpus/
Does the WebText corpus include patent data, say data from the USPTO or Google Patents?
Any news on this?
We have found a good alternative using Inria OSCAR.
Looks interesting @loretoparisi.
Am guessing even with compression those download files will be huge.
How have you found downloading them - did it take days?
@nmstoker yes we have used OSCAR to train a BERT based LM - https://github.com/musixmatchresearch/umberto To download it's like for Italian 69GB so you need fiber channel for sure.