set languages=['en'], but contains results outside of English

Open HT-NEKO opened this issue 1 year ago • 1 comments

Thank you for your work, it has been very helpful, but I have encountered some issues:

my code:

ds = load_dataset(
    "/data/public/models/RedPajama-Data-V2/RedPajama-Data-V2/RedPajama-Data-V2.py",
    partition="head_middle",
    languages=["en"],
    name="sample",)

but ds contains results outside of English: Thank you for your reply!

Dec 17 '24 07:12 HT-NEKO

Hi @HT-NEKO , the sample subset of the dataset cannot be split by languages as it is intended only for a quick glance at the data. If you want a smaller subset of the dataset you can choose any of sample-10B, sample-100B or sample-1T (corresponding to 10B, 100B, 1T many tokens). These support splitting by language.

Jan 06 '25 08:01 mauriceweber