llm-foundry
llm-foundry copied to clipboard
downloading datasets
Hi!
you have script for prepare data in your scripts/train which is python ../data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root ./my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'
can we download multilingual mc4, for example Russian? do you support this solution with your trainer?
i've got this error while i'm try to change data_subset:
ValueError: BuilderConfig ru not found. Available: ['en', 'realnewslike', 'en.noblocklist', 'en.noclean']
Hi @germanjke so right now the script only works on c4 and the pile. We are planning on extending it to work for any hugging face datasets soon.