cc_net icon indicating copy to clipboard operation
cc_net copied to clipboard

403 forbidden while downloading

Open Raven-Ren opened this issue 2 years ago • 2 comments

hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline. Wondering if this is bcs of the network settings from my side or is there anything wrong? Thanks in advance.

/ldap_home/raven.ren/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try pip install cc_net[getpy] warnings.warn( 2022-08-23 19:25 INFO 6898:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz 2022-08-23 19:25 INFO 6898:HashesCollector - Processed 0 documents in 0.00034h ( 0.0 doc/s). 2022-08-23 19:25 INFO 6898:HashesCollector - Found 0k unique hashes over 0k lines. Using 0.1GB of RAM. submitit ERROR (2022-08-23 19:25:23,974) - Submitted job triggered an exception 2022-08-23 19:25 ERROR 6898:submitit - Submitted job triggered an exception Traceback (most recent call last): File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module> submitit_main() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main process_job(args.folder) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job raise error File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job result = delayed.result() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result self._result = self.function(*self.args, **self.kwargs) File "/ldap_home/raven.ren/cc_net/cc_net/mine.py", line 273, in _hashes_shard jsonql.run_pipes( File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 455, in run_pipes write_jsons(data, output) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 496, in write_jsons for res in source: File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 284, in map for x in source: File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 195, in __iter__ n = len(self.segments) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 243, in segments segments = cc_segments(self.dump, self.cache_dir) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 38, in cc_segments f = jsonql.open_remote_file(wet_paths, cache=wet_paths_cache) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file raw_bytes = request_get_content(url) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1101, in request_get_content raise e File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1095, in request_get_content r.raise_for_status() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz

Raven-Ren avatar Aug 24 '22 03:08 Raven-Ren

Just update the line https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L23 with this URI:

WET_URL_ROOT = "https://data.commoncrawl.org"

shmpanski avatar Oct 07 '22 09:10 shmpanski

Just update the line https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L23 with this URI:

WET_URL_ROOT = "https://data.commoncrawl.org"

Thanks shmpanski, it does work for that problem, but here comes another problem, all the jobs failed while running _hashes_shard

s-zx avatar Nov 17 '22 09:11 s-zx