RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Got error while runing `python -m cc_net -l my -l gu`

Open tiendung opened this issue 2 years ago • 8 comments

Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation?

(racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu
usage: __main__.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD]
                   [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM]   
                   [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET]
                   [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES]
                   [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE]
                   [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG]
__main__.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'my'

tiendung avatar Apr 23 '23 01:04 tiendung

Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?

If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc

mauriceweber avatar Apr 23 '23 15:04 mauriceweber

Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")?

If you're goal is to create the english cc slice of the RP dataset, you can follow the steps in the readme in data_prep/cc: https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc

Hi @mauriceweber, I would like to re-create cc data for non-English language, my case is Vietnamese.

tiendung avatar Apr 24 '23 08:04 tiendung

awesome that you're working on more languages!

We haven't run the ccnet pipeline on non-english data as of yet, so we haven't run into this error error previously -- I will look into it and try to repduce the error you got.

Can you show me the steps you have run so far?

mauriceweber avatar Apr 24 '23 12:04 mauriceweber

I believe I got the same error trying to provide -l en. So I removed the language parameter and was able to proceed.

danielpclark avatar Apr 24 '23 17:04 danielpclark

Did you run make lang=my dl_lm prior to running the cc_net pipeline? This is supposed to download the lm used in the ccnet paper.

Also, let me know which steps you ran so that I can try to reproduce your error.

mauriceweber avatar Apr 25 '23 13:04 mauriceweber

I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829

make lang=en dl_lm
python -m cc_net -l en

tiendung avatar Apr 26 '23 00:04 tiendung

I also met this problem. Here is my log: $ python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1 usage: main.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD] [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM] [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET] [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES] [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE] [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG] main.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'en'

starlitsky2010 avatar Apr 26 '23 07:04 starlitsky2010

I did some more digging on this, and I could reproduce your error with python versions > 3.8. Can you try if downgrading to 3.8 solves the issue for you? We ran the pipeline with 3.8 and did not run into this error.

mauriceweber avatar Apr 26 '23 16:04 mauriceweber

Yes, it works when I used python 3.7.11. Thanks a lot!

starlitsky2010 avatar May 02 '23 08:05 starlitsky2010