Andrew Sokolov

Results 3 comments of Andrew Sokolov

Duplicate of #44

Just update the line https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L23 with this URI: ```python WET_URL_ROOT = "https://data.commoncrawl.org" ```

You can replace https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L73-L79 with ```python headers_map = {} for header in headers[1:]: if not header: continue key, value = header.split(": ", 1) headers_map[key] = value warc_type = headers_map["WARC-Type"] if...