[Errno 2] No such file or directory: 'cutoff.csv'
Hi, I'm trying to run this test case:
python3 -m cc_net --config config/test_segment.json
but encountered the following error:
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('test_data2'), mined_dir='mined_by_segment', execution='debug', num_shards=4, min_shard=-1, num_segments_per_shard=1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['de', 'it', 'fr'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=0, target_size='32M', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment'], experiments=[], cache_dir=PosixPath('test_data/wet_cache'))
['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment']
2023-04-27 11:41 INFO 39932:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x11b5b77c0>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x11b5b7a30>, <cc_net.perplexity.MultiSentencePiece object at 0x11b5b78e0>, <cc_net.perplexity.DocLM object at 0x11b5b7970>, <cc_net.perplexity.PerplexityBucket object at 0x11b5b7a60>, <cc_net.minify.Minifier object at 0x11b5b7be0>]
/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
warnings.warn(
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded hashes from test_data2/hashes/2019-09/0000.bin (0.700GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded 3_361_543 hashes from 1 files. (0.7GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:Classifier - Loading bin/lid.bin
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/de.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/de.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/it.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/it.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/fr.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/fr.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/de.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/de.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/it.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/it.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/fr.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/fr.arpa.bin (took 0.0min)
Traceback (most recent call last):
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
message = function(*x)
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 432, in run_pipes
transform = stack.enter_context(compose(transformers))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py", line 429, in enter_context
result = _cm_type.__enter__(cm)
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
self._prepare()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 352, in _prepare
t.__enter__()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
self._prepare()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/perplexity.py", line 267, in _prepare
cutoffs = pd.read_csv(self.cutoff_csv, index_col=0)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/common.py", line 859, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/Users/work/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'
Are there any possible reasons? Python3.9.6 on MacOS
here -> https://github.com/facebookresearch/cc_net/blob/main/cc_net/data/cutoff.csv
thanks! but I still failed. here's another problem. I'm in trouble with setting mine_num_processes greater than 1, seems that a lambda function cannot be pickled:
Traceback (most recent call last):
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
message = function(*x)
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes
multiprocessing.Pool(
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 212, in __init__
self._repopulate_pool()
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Users/miniconda3/envs/py38/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_mine_shard.<locals>.<lambda>'
Any suggestion will be helpful
Hi @Anery, this might be due to failed installation. Did the following steps run successfully for you (run from the cc directory)?
# Installation
cd `cc_net`
mkdir data
sudo apt-get update
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
make install
make lang=en dl_lm
Thanks for your reply, I'm running on macos, some of the pkg are not installed. I'll try on Linux latter
It works well on Linux, I’ll close this issue. Thanks.