RedPajama-Data
RedPajama-Data copied to clipboard
set languages=['en'], but contains results outside of English
Thank you for your work, it has been very helpful, but I have encountered some issues:
my code:
ds = load_dataset(
"/data/public/models/RedPajama-Data-V2/RedPajama-Data-V2/RedPajama-Data-V2.py",
partition="head_middle",
languages=["en"],
name="sample",)
but ds contains results outside of English:
Thank you for your reply!
Hi @HT-NEKO , the sample subset of the dataset cannot be split by languages as it is intended only for a quick glance at the data. If you want a smaller subset of the dataset you can choose any of sample-10B, sample-100B or sample-1T (corresponding to 10B, 100B, 1T many tokens). These support splitting by language.