datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Loading Data from HDF files

Open FeryET opened this issue 4 years ago • 7 comments

Is your feature request related to a problem? Please describe. More often than not I come along big HDF datasets, and currently there is no straight forward way to feed them to a dataset.

Describe the solution you'd like I would love to see a from_h5 method that gets an interface implemented by the user on how items are extracted from dataset (in case of multiple datasets containing elements like arrays and metadata and etc).

Describe alternatives you've considered Currently I manually load hdf files using h5py and implement PyTorch dataset interface. For small h5 files I load them into a pandas dataframe and use from_pandas function in the datasets package to load them, but for big datasets this is not feasible.

Additional context HDF files are widespread throughout different domains and are one of the go to's for many researchers/scientists/engineers who work with numerical data. Given datasets' usecases have outgrown NLP use cases, it will make a lot of sense focusing on things like supporting HDF files.

FeryET avatar Oct 19 '21 19:10 FeryET

I'm currently working on bringing Ecoset to huggingface datasets and I would second this request...

DiGyt avatar May 31 '22 12:05 DiGyt

I would also like this support or something similar. Geospatial datasets come in netcdf which is derived from hdf5, or zarr. I've gotten zarr stores to work with datasets and streaming, but it takes awhile to convert the data to zarr if it's not stored in that natively.

jacobbieker avatar Jun 15 '22 22:06 jacobbieker

@mariosasko , I would like to contribute on this "good second issue" . Is there anything in the works for this Issue or can I go ahead ?

VijayKalmath avatar Jul 31 '22 20:07 VijayKalmath

Hi @VijayKalmath! As far as I know, nobody is working on it, so feel free to take over. Also, before you start, I suggest you comment #self-assign on this issue to assign it to yourself.

mariosasko avatar Aug 08 '22 14:08 mariosasko

#self-assign

VijayKalmath avatar Aug 08 '22 15:08 VijayKalmath

Hey @mariosasko can you assign this issue to me !!

zutarich avatar Oct 09 '23 06:10 zutarich

So basically, we just need to load HDF5 files to Parquet?

e.g. Like this? https://stackoverflow.com/questions/46157709/converting-hdf5-to-parquet-without-loading-into-memory

shermansiu avatar Dec 27 '23 20:12 shermansiu