RedPajama-Data
RedPajama-Data copied to clipboard
Got error while runing `python -m cc_net -l my -l gu`
Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation?
(racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu
usage: __main__.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD]
[--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM]
[-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET]
[--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES]
[-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE]
[--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG]
__main__.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'my'
Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?
If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc
Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?
If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in
data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc
Hi @mauriceweber, I would like to re-create cc data for non-English language, my case is Vietnamese.
awesome that you're working on more languages!
We haven't run the ccnet pipeline on non-english data as of yet, so we haven't run into this error error previously -- I will look into it and try to repduce the error you got.
Can you show me the steps you have run so far?
I believe I got the same error trying to provide -l en. So I removed the language parameter and was able to proceed.
Did you run make lang=my dl_lm prior to running the cc_net pipeline? This is supposed to download the lm used in the ccnet paper.
Also, let me know which steps you ran so that I can try to reproduce your error.
I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829
make lang=en dl_lm
python -m cc_net -l en
I also met this problem. Here is my log: $ python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1 usage: main.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD] [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM] [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET] [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES] [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE] [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG] main.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'en'
I did some more digging on this, and I could reproduce your error with python versions > 3.8. Can you try if downgrading to 3.8 solves the issue for you? We ran the pipeline with 3.8 and did not run into this error.
Yes, it works when I used python 3.7.11. Thanks a lot!