datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Allow downloading just some columns of a dataset

Open osanseviero opened this issue 2 years ago • 9 comments

Is your feature request related to a problem? Please describe. Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case

Describe the solution you'd like Be able to just download some columns of a dataset, such as doing

load_dataset("huggan/wikiart",columns=["artist", "genre"])

Although this might make things a bit complicated in terms of local caching of datasets.

osanseviero avatar Apr 06 '22 16:04 osanseviero

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess

lhoestq avatar Apr 06 '22 16:04 lhoestq

Actually for csv pandas has usecols which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.

osanseviero avatar Apr 07 '22 07:04 osanseviero

Bumping the visibility of this :) Is there a recommended way of doing this?

lukasugar avatar Feb 20 '24 16:02 lukasugar

Passing columns=[...] to load_dataset() in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

lhoestq avatar Feb 21 '24 11:02 lhoestq

I tried using the columns=['bambara'] on this dataset oza75/bambara-tts which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.

oza75 avatar Apr 07 '24 13:04 oza75

It doesn't work for the dataset with parquet format. Are we missing something?

Ravi2712 avatar May 16 '24 14:05 Ravi2712

It only works for streaming=True. When not streaming it does download the full files locally before reading the data

lhoestq avatar May 17 '24 09:05 lhoestq

Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode.

kdcyberdude avatar Jul 06 '24 01:07 kdcyberdude