datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add option to ignore keys/columns when loading a dataset from jsonl(or any other data format)

Open avishaiElmakies opened this issue 7 months ago • 10 comments

Feature request

Hi, I would like the option to ignore keys/columns when loading a dataset from files (e.g. jsonl).

Motivation

I am working on a dataset which is built on jsonl. It seems the dataset is unclean and a column has different types in each row. I can't clean this or remove the column (It is not my data and it is too big for me to clean and save on my own hardware). I would like the option to just ignore this column when using load_dataset, since i don't need it. I tried to look if this is already possible but couldn't find a solution. if there is I would love some help. If it is not currently possible, I would love this feature

Your contribution

I don't think I can help this time, unfortunately.

avishaiElmakies avatar Jun 05 '25 11:06 avishaiElmakies

Good point, I'd be in favor of having the columns argument in JsonConfig (and the others) to align with ParquetConfig to let users choose which columns to load and ignore the rest

lhoestq avatar Jun 05 '25 12:06 lhoestq

Is it possible to ignore columns when using parquet?

avishaiElmakies avatar Jun 05 '25 12:06 avishaiElmakies

Yes, you can pass columns=... to load_dataset to select which columns to load, and it is passed to ParquetConfig :)

lhoestq avatar Jun 05 '25 12:06 lhoestq

Ok, i didn't know that. Anyway, it would be good to add this to others

avishaiElmakies avatar Jun 05 '25 12:06 avishaiElmakies

Hi @lhoestq

I'd like to take this up!

As you suggested, I’ll extend the support for the columns parameter (currently used in ParquetConfig) to JsonConfig as well. This will allow users to selectively load specific keys/columns from .jsonl (or .json) files and ignore the rest — solving the type inconsistency issues in unclean datasets.

ArjunJagdale avatar Jun 27 '25 06:06 ArjunJagdale

Hi @avishaiElmakies and @lhoestq

Just wanted to let you know that this is now implemented in #7594 As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

✅ Example:

from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

🔧 Summary of changes:

  • Added columns: Optional[List[str]] to JsonConfig
  • Updated _generate_tables() to filter selected columns
  • Forwarded columns argument from load_dataset() to the config
  • Added test case to validate behavior

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

ArjunJagdale avatar Jun 27 '25 17:06 ArjunJagdale

@ArjunJagdale this looks great! Thanks! I believe that every format that is supported by datasets should probably have this feature since it is very useful and will streamline the api (people will know that they can just use columns to select the columns they want, and it will not be dependent on the data format)

avishaiElmakies avatar Jun 27 '25 17:06 avishaiElmakies

Thanks @avishaiElmakies — totally agree, making columns=... support consistent across all formats would be really helpful for users.

ArjunJagdale avatar Jun 28 '25 09:06 ArjunJagdale

#Codex Fix

SirBoely avatar Oct 23 '25 14:10 SirBoely

#Codex Fix

SirBoely avatar Oct 23 '25 14:10 SirBoely