cc_net
cc_net copied to clipboard
Tools to download and cleanup Common Crawl data
how to only compute the perplexity of each paragraph using your language model with local data? i don't want to use -d to dump data? I have downloaded the Chinese...
When I use `python -m cc_net ` to download and extract work, I am told that the connection cannot open `requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320299852.23/wet/CC-MAIN-20220116093137-20220116123137-00540.warc.wet.gz` process_wet_file.py...
我已经下好了文件,如何提取呢,脚本是下载和提取一起的,我只想要提取部分,该怎么处理
https://data.statmt.org/cc-100/ This link only provides the corpus extracted in 2018. Is there any corpus from 2018 onwards?
When I execute: `python -m cc_net --dump 2019-13` Here is the full log. Err: ```makefile 2023-05-10 08:56 INFO 259781:cc_net.jsonql - preparing [, , ] 2023-05-10 08:56 INFO 259781:cc_net.jsonql - Opening...
Hello, Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into...
Is there a tutorial for installing and using this project for win10?
When I execute: `python -m cc_net -l fa` It throws the following exception: ``` File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 502, in readinto n = self.fp.readinto(b) File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto return...
Hi, first of all, thank you for your great work on multilingual NLP. I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from [statmt](https://data.statmt.org/cc-100/)...
Update CC_net to make it can be run in Spark cluster: 1. Create Spark executor to split tasks and running tasks parallelly 2. Update config to adopt spark execution mode...