datatrove
datatrove copied to clipboard
boto timeout when I read CC with Warc
Hi,
I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when reading files from CC, the connection sometimes times out, causing the execution to stop.
Here is the error message I receive:
File "/opt/conda/lib/python3.10/site-packages/aiobotocore/httpsession.py", line 259, in send
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://commoncrawl.s3.us-east-1.amazonaws.com/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/warc/CC-MAIN-20231211210408-20231212000408-00000.warc.gz"
Is it possible to change some parameters to mitigate this problem?
Thanks!