datasets
datasets copied to clipboard
Allow downloading just some columns of a dataset
Is your feature request related to a problem? Please describe. Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case
Describe the solution you'd like Be able to just download some columns of a dataset, such as doing
load_dataset("huggan/wikiart",columns=["artist", "genre"])
Although this might make things a bit complicated in terms of local caching of datasets.
In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess
Actually for csv pandas has usecols
which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.
Bumping the visibility of this :) Is there a recommended way of doing this?
Passing columns=[...]
to load_dataset()
in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented
I tried using the columns=['bambara']
on this dataset oza75/bambara-tts
which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.
It doesn't work for the dataset with parquet
format. Are we missing something?
It only works for streaming=True
. When not streaming it does download the full files locally before reading the data
Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode.