cc_net
cc_net copied to clipboard
Numerous Errors
Hello,
Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.
It seems as if the link to download from cc has changed to:
https://data.commoncrawl.org/
Some of the header names were changed as well. This fixed those errors:
headers_map = {}
for header in headers[1:]:
if not header:
continue
key, value = header.split(": ", 1)
headers_map[key] = value
warc_type = headers_map["WARC-Type"]
if warc_type != "conversion":
return None
url = headers_map["WARC-Target-URI"]
date = headers_map["WARC-Date"]
digest = headers_map["WARC-Block-Digest"]
length = int(headers_map["Content-Length"])
Finally, running into this other issue:
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/wet/CC-MAIN-20230320083513-20230320113513-00114.warc.wet.gz
I have not been able to resolve this error yet.
Any help would be greatly appreciated.
Thank you,
Enrico
I have the similar problem, maybe it is caused by requesting too much. I got 'slow down' msg when I access the link that raised in my browser.
I am trying to download the dataset to reproduce the results from the Toolformer paper. I have been struggling with this dataset for a while. Did you manage to solve the issue and get the data? Maybe by manually downloading the data, and skipping that step of the pipeline? @conceptofmind I am actually using your Toolformer repo for my research, thanks for that :)