europeana-bert
europeana-bert copied to clipboard
Massive German Text Corpus released
Hi @stefan-it
I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html
It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.
The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.
Maybe you want to use it with your next models... ;-)
Hi @PhilipMay ,
thanks for that hint! Corpus looks really interesting and:
This preprocessing is filtering duplicates only inside the same dump. This step took approx. 50,000 CPU hours and 400 TB of network traffic to the common crawl s3 bucket.
is really awesome! I'll definitely work with this corpus in near future :hugs:
Hi @PhilipMay ,
just one question: I've downloaded the HEAD and MIDDLE archives (using the urls provided in gc4_corpus_head_urls.txt
and gc4_corpus_middle_urls.txt
). However, a du -sh
shows "only" 418GB in total. Can you confirm that, or how can I check if something went wrong. Here's my ls -hl
of all files:
:thinking:
Thanks!
Hmm - maybe 450 GB was a rather inaccurate estimate. What do you think @Phil1108 ? Or did the guys from iisys somehow lost files?
I would do 2 things: Count the files and check if they match our number of links and then use gzip to test the archives.
Number of files are correct (I checked both *.txt files and the links on the website).
I will check the content length header of the provided files now, e.g:
curl -I https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0000_2015-48.tar.gz
HTTP/1.1 200 OK
Date: Thu, 22 Apr 2021 06:32:46 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Sat, 31 Oct 2020 13:12:16 GMT
ETag: "65419493-5b2f7431a0c00"
Accept-Ranges: bytes
Content-Length: 1698796691
Content-Type: application/x-gzip
The Content-Length
header returns the file size and is identical to the size on disk:
ls -l de_middle_0000_2015-48.tar.gz
-rw-r--r-- 1 stefan users 1698796691 Okt 31 14:12 de_middle_0000_2015-48.tar.gz
I'll report back if I find some broken tar archives 😅
With some bash magic:
for url in $(cat gc4_corpus_middle_urls.txt)
do
filename=$(echo $url | cut -d "/" -f 8)
disk_size=$(stat -c "%s" $filename)
download_size=$(curl --silent -I $url | grep "Content-Length:" | cut -d " " -f 2)
echo $filename $disk_size $download_size
done
Files for head and middle:
comparison_head.txt comparison_middle.txt
So it turns out, that all downloaded files have the exact file size as their content-length header :hugs:
And I calculates the number of downloaded bytes: 448598516042
-> which is pretty close to 450GB then 😅
More precisely: 194227285957
(HEAD) + 254371230085
(MIDDLE) = 448598516042
in total.
So I guess everything was ok! Thanks for providing this massive corpus, I will extract all archives now :)
Good luck and thanks for reporting back.
@stefan-it Yeah sorry was the usually 1000 vs 1024 issue, edited that in the Readme
I usually never extract the data to keep disk usage low. Just added another subtopic here https://german-nlp-group.github.io/projects/gc4-corpus.html#necessary-steps-before-usage with a short gist linked for custom filtering
Hi @PhilipMay and @Phil1108 ,
thanks again for providing the corpus (and the cool filtering script). I've trained an ELECTRA model (with a larger subword vocab than usual, 32k is coming this week or next week).
I've done some preliminary experiments (GermEval 14 and 18) and the results are better than GELECTRA (base). Here's the repo with all 11 checkpoints (100k steps for a 1M trained model in total):
https://github.com/stefan-it/gc4lm
(Spoiler: the 900k checkpoint works best for NER in my experiments 😅)