llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

downloading datasets

Open germanjke opened this issue 2 years ago • 1 comments

Hi!

you have script for prepare data in your scripts/train which is python ../data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root ./my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'

can we download multilingual mc4, for example Russian? do you support this solution with your trainer?

i've got this error while i'm try to change data_subset:

ValueError: BuilderConfig ru not found. Available: ['en', 'realnewslike', 'en.noblocklist', 'en.noclean']

germanjke avatar May 22 '23 09:05 germanjke

Hi @germanjke so right now the script only works on c4 and the pile. We are planning on extending it to work for any hugging face datasets soon.

codestar12 avatar May 22 '23 16:05 codestar12