datatrove boto timeout when I read CC with Warc

boto timeout when I read CC with Warc

Open marcopasqua opened this issue 7 months ago • 1 comments

Hi,

I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when reading files from CC, the connection sometimes times out, causing the execution to stop.

Here is the error message I receive:

  File "/opt/conda/lib/python3.10/site-packages/aiobotocore/httpsession.py", line 259, in send
    raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://commoncrawl.s3.us-east-1.amazonaws.com/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/warc/CC-MAIN-20231211210408-20231212000408-00000.warc.gz"

Is it possible to change some parameters to mitigate this problem?

Thanks!

Jul 17 '24 14:07 marcopasqua

datatrove datatrove copied to clipboard

boto timeout when I read CC with Warc

datatrove
datatrove copied to clipboard