RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Invalid argument when running cc_net

Open Practicinginhell opened this issue 2 years ago • 2 comments

Hi everyone, I try to run the cc net using this command python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1. But the invalid argument value for sequence type happened for -l argument. Thank you in advance for any help.

Practicinginhell avatar Nov 07 '23 12:11 Practicinginhell

the -l is for the language. This was for an older version of CC Net. The original project has been archived, but you can remove the "-l en" part and edit the file here: https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/cc/cc_net/cc_net/mine.py#L88C37-L88C37

and add the languages you want. for example to just have en, you would do:

lang_whitelist: Sequence[str] = [ "en" ]

hicotton02 avatar Nov 07 '23 15:11 hicotton02

Thank you! I fixed it with the same way that you mentioned above. But I wonder why they don't update the Readme in cc_net module. I think this is a issue related to func_argparse that don't receive subsequent arguments as a Sequence because this error still happened even when I used the original cc_net repo

Practicinginhell avatar Nov 07 '23 17:11 Practicinginhell