cc_net issues

how to only compute the perplexity of each paragraph using your language model with local data?

1

how to only compute the perplexity of each paragraph using your language model with local data? i don't want to use -d to dump data? I have downloaded the Chinese...

rongjingyue423

503 Server Error: Service Unavailable for url

1

When I use `python -m cc_net ` to download and extract work, I am told that the connection cannot open `requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320299852.23/wet/CC-MAIN-20220116093137-20220116123137-00540.warc.wet.gz` process_wet_file.py...

yangyang0202

从wet格式中提取文本

2

我已经下好了文件，如何提取呢，脚本是下载和提取一起的，我只想要提取部分，该怎么处理

wwfcnu

Whether CC_Net provides an existing monolingual corpus

https://data.statmt.org/cc-100/ This link only provides the corpus extracted in 2018. Is there any corpus from 2018 onwards?

yangyang0202

requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url

1

When I execute: `python -m cc_net --dump 2019-13` Here is the full log. Err: ```makefile 2023-05-10 08:56 INFO 259781:cc_net.jsonql - preparing [, , ] 2023-05-10 08:56 INFO 259781:cc_net.jsonql - Opening...

Hieunohair

Numerous Errors

2

Hello, Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into...

conceptofmind

win10 use cc_net

Is there a tutorial for installing and using this project for win10?

z-x-x136

Error: Job not requeued because: timed-out and not checkpointable.

12

When I execute: `python -m cc_net -l fa` It throws the following exception: ``` File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 502, in readinto n = self.fp.readinto(b) File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto return...

hadifar

CC-100 in statmt version is different from paper

Hi, first of all, thank you for your great work on multilingual NLP. I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from [statmt](https://data.statmt.org/cc-100/)...

nbqu

Update CC_net code to make it can be run in Spark cluster

1

Update CC_net to make it can be run in Spark cluster: 1. Create Spark executor to split tasks and running tasks parallelly 2. Update config to adopt spark execution mode...

junwan-db

cc_net
cc_net copied to clipboard

Metadata

how to only compute the perplexity of each paragraph using your language model with local data?

503 Server Error: Service Unavailable for url

从wet格式中提取文本

Whether CC_Net provides an existing monolingual corpus

requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url

Numerous Errors

win10 use cc_net

Error: Job not requeued because: timed-out and not checkpointable.

CC-100 in statmt version is different from paper

Update CC_net code to make it can be run in Spark cluster

← Metadata

Owner

Metadata

cc_net cc_net copied to clipboard

Metadata

← Metadata

Owner

Metadata

cc_net
cc_net copied to clipboard