datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Support scientific data formats

Open lhoestq opened this issue 3 months ago • 18 comments

List of formats and libraries we can use to load the data in datasets:

  • [ ] DICOMs: pydicom
  • [x] NIfTIs: nibabel
  • [ ] WFDB: wfdb

cc @zaRizk7 for viz

Feel free to comment / suggest other formats and libs you'd like to see or to share your interest in one of the mentioned format

lhoestq avatar Oct 09 '25 10:10 lhoestq

Please add the support for Zarr! That's what we use in the Bioimaging community. It is crucial, because raw upload of a single bio image can take terrabytes in memory!

The python library would be bioio or zarr:

  • [ ] Zarr: bioio or zarr

See a Zarr example

cc @joshmoore

stefanches7 avatar Oct 10 '25 11:10 stefanches7

@stefanches7 zarr is already usable with the hf hub as an array store. See this example from the docs:

import numpy as np
import zarr

embeddings = np.random.randn(50000, 1000).astype("float32")

# Write an array to a repo
with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
   foo = root.create_group("embeddings")
   foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
   foobar[:] = embeddings

# Read an array from a repo
with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
   first_row = root["embeddings/experiment_0"][0]

Is there additional functionality that would not be covered by this?

cakiki avatar Oct 28 '25 19:10 cakiki

@cakiki I think some tiling capabilities, as well as metadata / labels handling. Consult ome-zarr doc here: https://ome-zarr.readthedocs.io/en/stable/python.html Visualization would be the cherry on the top.

cc @joshmoore @lubianat @St3V0Bay: curious what you think

stefanches7 avatar Oct 29 '25 10:10 stefanches7

zarr-specific dataset viewer would be very cool

cakiki avatar Oct 29 '25 13:10 cakiki

A support for BIDS it would be perfect, I think it's possible to do all the biosinal can be done with mne. There's a cool community for decoding brain signals, and now with EMG. The new META bracelet EMG is saving things in BIDS.

I can help to interface, coding and try to make this happen. I am available at hugging face discord with the username aristimunha, if some 1-to-1 discuss it would be necessary :)

bruAristimunha avatar Oct 30 '25 21:10 bruAristimunha

@lhoestq , @cakiki , do you think we can make this happen?

bruAristimunha avatar Oct 30 '25 21:10 bruAristimunha

If you give me the OK, I'll create the PR to make everything for a Biosignal Reader logic, I already studied the nilabel PR :)

bruAristimunha avatar Oct 30 '25 21:10 bruAristimunha

That would be an amazing addition ! Feel free to ping me in your PR for review or if you have questions / if I can help

lhoestq avatar Oct 31 '25 10:10 lhoestq

@bruAristimunha @lhoestq I've recalled a gold of a resource for BIDS: https://openneuro.org/

Do you think there is a data-easy way to make those visible here on HuggingFace? Afaik they use datalad to fetch the data. Maybe the best way is to leave OpenNeuro as-is, not connecting it to HuggingFace at all - just an idea I had spontaneously.

stefanches7 avatar Oct 31 '25 11:10 stefanches7

I know an "easy" way to create interoperability with all biosignal datasets from OpenNeuro =)

For biosignal data, we can use EEGDash to create a Pytorch dataset, which automates fetch, lazy read, and converts to a pytorch dataset.

I have a question about the best serialization for a Hugging Face dataset, but I can discuss it with some of you on Discord; my username is aristimunha.

bruAristimunha avatar Oct 31 '25 11:10 bruAristimunha

I can explain it publicly too, but I think a short 5-minute conversation would be better than many, many texts to explain the details.

bruAristimunha avatar Oct 31 '25 11:10 bruAristimunha

It's ok to have discussions in one place here (or in a separate issue if it's needed) - I also generally check github more often than discord ^^'

lhoestq avatar Oct 31 '25 14:10 lhoestq

Hi @bruAristimunha @lhoestq any way we could proceed on this? I see someone posted a Nifti vizualization PR: https://github.com/huggingface/datasets/pull/7874 - I think it would be a shame if we couldn't accompany that by a neat way to import BIDS Nifti!

stefanches7 avatar Nov 21 '25 09:11 stefanches7

@stefanches7 author of #7874 here, would be open to expand the current support to BIDS as well after having a brief look. Maybe having a brief call over Discord (my username: TobiasPitters on the huggingface discord server) might help sorting things out, since I am not familiar with BIDS. So getting an understanding over test cases needed, etc. would be great!

CloseChoice avatar Nov 21 '25 09:11 CloseChoice

Hey!!

From a bids perspective, I can provide full support for all biosignal types (EEG, iEEG, MEG, EMG). BIDS is a well-established contract format; I believe we can design something that supports the entire medical domain. I think it just requires a few details to be aligned.

From my perspective, the tricky part is how to best adapt and serialize from the Hugging Face perspective.

Under the hood, for the biosignal part, I think I would use mne for interoperability and eegdash to create the serialized dataset, but we can definitely discuss this further. I will ping you @CloseChoice on Discord.

bruAristimunha avatar Nov 21 '25 10:11 bruAristimunha

had a discussion with @neurolabusc and here's a quick wrap-up:

  • BIDS support would be huge (@bruAristimunha would be great if we could catch up on that)
  • DICOM support as well, but that might be harder due to a lot of variety in how headers are handled, vendor specifics etc. So to have a reliable pipeline to interact with whole folders of DICOM files (including metadata) would require a lot of work and a lot of testing. Therefore I set https://github.com/huggingface/datasets/pull/7835 back to draft mode. But there are tools that ease the way, especially https://github.com/ImagingDataCommons/highdicom (or potentially https://github.com/QIICR/dcmqi).
  • Getting users would help in order to understand what other formats/features are required therefore loading a bunch of open datasets to the hub using the new Nifti feature would be great. Some tutorials might help here as well.

CloseChoice avatar Nov 26 '25 15:11 CloseChoice

Hi @CloseChoice and @bruAristimunha, glad to meet you both! We could appoint a call; I am currently moving to a new job, so the time slots are limited, but let's connect over Discord and see what we could do.

  • BIDS: our hackathon team @zuazo @ekarrieta @lakshya16157 put up a BIDS format converter: https://huggingface.co/spaces/stefanches/OpenBIDSifier. Might be useful for imaging dataset conversion to BIDS.
  • DICOM support: cc @St3V0Bay, the author of DICOM support in CroissantML (https://github.com/mlcommons/croissant/pull/942)

cc @nolden

stefanches7 avatar Nov 26 '25 15:11 stefanches7

my username is aristimunha within the huggieng face discord to discuss more

bruAristimunha avatar Nov 26 '25 16:11 bruAristimunha