RedPajama-Data Got error while runing `python -m cc

Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation?

(racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu
usage: __main__.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD]
                   [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM]   
                   [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET]
                   [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES]
                   [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE]
                   [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG]
__main__.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'my'

Apr 23 '23 01:04 tiendung

Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?

If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc

Apr 23 '23 15:04 mauriceweber

Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?

If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc

Hi @mauriceweber, I would like to re-create cc data for non-English language, my case is Vietnamese.

Apr 24 '23 08:04 tiendung

awesome that you're working on more languages!

We haven't run the ccnet pipeline on non-english data as of yet, so we haven't run into this error error previously -- I will look into it and try to repduce the error you got.

Can you show me the steps you have run so far?

Apr 24 '23 12:04 mauriceweber

I believe I got the same error trying to provide -l en. So I removed the language parameter and was able to proceed.

Apr 24 '23 17:04 danielpclark

Did you run make lang=my dl_lm prior to running the cc_net pipeline? This is supposed to download the lm used in the ccnet paper.

Also, let me know which steps you ran so that I can try to reproduce your error.

Apr 25 '23 13:04 mauriceweber

I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829

make lang=en dl_lm
python -m cc_net -l en

Apr 26 '23 00:04 tiendung

I also met this problem. Here is my log: $ python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1 usage: main.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD] [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM] [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET] [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES] [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE] [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG] main.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'en'

Apr 26 '23 07:04 starlitsky2010

I did some more digging on this, and I could reproduce your error with python versions > 3.8. Can you try if downgrading to 3.8 solves the issue for you? We ran the pipeline with 3.8 and did not run into this error.

Apr 26 '23 16:04 mauriceweber

Yes, it works when I used python 3.7.11. Thanks a lot!

May 02 '23 08:05 starlitsky2010

RedPajama-Data
RedPajama-Data copied to clipboard

Got error while runing `python -m cc_net -l my -l gu`

RedPajama-Data RedPajama-Data copied to clipboard

Got error while runing `python -m cc_net -l my -l gu`

RedPajama-Data
RedPajama-Data copied to clipboard