datasets Allow downloading just some columns of a dataset

Is your feature request related to a problem? Please describe. Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case

Describe the solution you'd like Be able to just download some columns of a dataset, such as doing

load_dataset("huggan/wikiart",columns=["artist", "genre"])

Although this might make things a bit complicated in terms of local caching of datasets.

Apr 06 '22 16:04 osanseviero

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess

Apr 06 '22 16:04 lhoestq

Actually for csv pandas has usecols which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.

Apr 07 '22 07:04 osanseviero

Bumping the visibility of this :) Is there a recommended way of doing this?

Feb 20 '24 16:02 lukasugar

Passing columns=[...] to load_dataset() in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

Feb 21 '24 11:02 lhoestq

I tried using the columns=['bambara'] on this dataset oza75/bambara-tts which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.

Apr 07 '24 13:04 oza75

It doesn't work for the dataset with parquet format. Are we missing something?

May 16 '24 14:05 Ravi2712

It only works for streaming=True. When not streaming it does download the full files locally before reading the data

May 17 '24 09:05 lhoestq

Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode.

Jul 06 '24 01:07 kdcyberdude

datasets datasets copied to clipboard

Allow downloading just some columns of a dataset

datasets
datasets copied to clipboard