Add option to ignore keys/columns when loading a dataset from jsonl(or any other data format)
Feature request
Hi, I would like the option to ignore keys/columns when loading a dataset from files (e.g. jsonl).
Motivation
I am working on a dataset which is built on jsonl. It seems the dataset is unclean and a column has different types in each row. I can't clean this or remove the column (It is not my data and it is too big for me to clean and save on my own hardware).
I would like the option to just ignore this column when using load_dataset, since i don't need it.
I tried to look if this is already possible but couldn't find a solution. if there is I would love some help. If it is not currently possible, I would love this feature
Your contribution
I don't think I can help this time, unfortunately.
Good point, I'd be in favor of having the columns argument in JsonConfig (and the others) to align with ParquetConfig to let users choose which columns to load and ignore the rest
Is it possible to ignore columns when using parquet?
Yes, you can pass columns=... to load_dataset to select which columns to load, and it is passed to ParquetConfig :)
Ok, i didn't know that. Anyway, it would be good to add this to others
Hi @lhoestq
I'd like to take this up!
As you suggested, I’ll extend the support for the columns parameter (currently used in ParquetConfig) to JsonConfig as well. This will allow users to selectively load specific keys/columns from .jsonl (or .json) files and ignore the rest — solving the type inconsistency issues in unclean datasets.
Hi @avishaiElmakies and @lhoestq
Just wanted to let you know that this is now implemented in #7594
As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.
✅ Example:
from datasets import load_dataset
dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']
🔧 Summary of changes:
- Added
columns: Optional[List[str]]toJsonConfig - Updated
_generate_tables()to filter selected columns - Forwarded
columnsargument fromload_dataset()to the config - Added test case to validate behavior
Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.
@ArjunJagdale this looks great! Thanks!
I believe that every format that is supported by datasets should probably have this feature since it is very useful and will streamline the api (people will know that they can just use columns to select the columns they want, and it will not be dependent on the data format)
Thanks @avishaiElmakies — totally agree, making columns=... support consistent across all formats would be really helpful for users.
#Codex Fix
#Codex Fix