europeana-bert icon indicating copy to clipboard operation
europeana-bert copied to clipboard

Massive German Text Corpus released

Open PhilipMay opened this issue 3 years ago • 9 comments

Hi @stefan-it

I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html

It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.

The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

Maybe you want to use it with your next models... ;-)

PhilipMay avatar Apr 18 '21 15:04 PhilipMay

Hi @PhilipMay ,

thanks for that hint! Corpus looks really interesting and:

This preprocessing is filtering duplicates only inside the same dump. This step took approx. 50,000 CPU hours and 400 TB of network traffic to the common crawl s3 bucket.

is really awesome! I'll definitely work with this corpus in near future :hugs:

stefan-it avatar Apr 19 '21 07:04 stefan-it

Hi @PhilipMay ,

just one question: I've downloaded the HEAD and MIDDLE archives (using the urls provided in gc4_corpus_head_urls.txt and gc4_corpus_middle_urls.txt). However, a du -sh shows "only" 418GB in total. Can you confirm that, or how can I check if something went wrong. Here's my ls -hl of all files:

listing.txt

:thinking:

Thanks!

stefan-it avatar Apr 22 '21 06:04 stefan-it

Hmm - maybe 450 GB was a rather inaccurate estimate. What do you think @Phil1108 ? Or did the guys from iisys somehow lost files?

I would do 2 things: Count the files and check if they match our number of links and then use gzip to test the archives.

PhilipMay avatar Apr 22 '21 06:04 PhilipMay

Number of files are correct (I checked both *.txt files and the links on the website).

I will check the content length header of the provided files now, e.g:

curl -I https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0000_2015-48.tar.gz
HTTP/1.1 200 OK
Date: Thu, 22 Apr 2021 06:32:46 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Sat, 31 Oct 2020 13:12:16 GMT
ETag: "65419493-5b2f7431a0c00"
Accept-Ranges: bytes
Content-Length: 1698796691
Content-Type: application/x-gzip

The Content-Length header returns the file size and is identical to the size on disk:

ls -l de_middle_0000_2015-48.tar.gz
-rw-r--r-- 1 stefan users 1698796691 Okt 31 14:12 de_middle_0000_2015-48.tar.gz

I'll report back if I find some broken tar archives 😅

stefan-it avatar Apr 22 '21 06:04 stefan-it

With some bash magic:

for url in $(cat gc4_corpus_middle_urls.txt)
do
  filename=$(echo $url | cut -d "/" -f 8)
  disk_size=$(stat -c "%s" $filename)

  download_size=$(curl --silent -I $url | grep "Content-Length:" | cut -d " " -f 2)
  echo $filename $disk_size $download_size 
done

Files for head and middle:

comparison_head.txt comparison_middle.txt

So it turns out, that all downloaded files have the exact file size as their content-length header :hugs:

stefan-it avatar Apr 22 '21 07:04 stefan-it

And I calculates the number of downloaded bytes: 448598516042 -> which is pretty close to 450GB then 😅

More precisely: 194227285957 (HEAD) + 254371230085 (MIDDLE) = 448598516042 in total.

So I guess everything was ok! Thanks for providing this massive corpus, I will extract all archives now :)

stefan-it avatar Apr 22 '21 07:04 stefan-it

Good luck and thanks for reporting back.

PhilipMay avatar Apr 22 '21 07:04 PhilipMay

@stefan-it Yeah sorry was the usually 1000 vs 1024 issue, edited that in the Readme

I usually never extract the data to keep disk usage low. Just added another subtopic here https://german-nlp-group.github.io/projects/gc4-corpus.html#necessary-steps-before-usage with a short gist linked for custom filtering

Phil1108 avatar Apr 22 '21 11:04 Phil1108

Hi @PhilipMay and @Phil1108 ,

thanks again for providing the corpus (and the cool filtering script). I've trained an ELECTRA model (with a larger subword vocab than usual, 32k is coming this week or next week).

I've done some preliminary experiments (GermEval 14 and 18) and the results are better than GELECTRA (base). Here's the repo with all 11 checkpoints (100k steps for a 1M trained model in total):

https://github.com/stefan-it/gc4lm

(Spoiler: the 900k checkpoint works best for NER in my experiments 😅)

stefan-it avatar May 02 '21 11:05 stefan-it