Support scientific data formats
List of formats and libraries we can use to load the data in datasets:
- [ ] DICOMs: pydicom
- [x] NIfTIs: nibabel
- [ ] WFDB: wfdb
cc @zaRizk7 for viz
Feel free to comment / suggest other formats and libs you'd like to see or to share your interest in one of the mentioned format
Please add the support for Zarr! That's what we use in the Bioimaging community. It is crucial, because raw upload of a single bio image can take terrabytes in memory!
The python library would be bioio or zarr:
- [ ] Zarr:
bioioorzarr
See a Zarr example
cc @joshmoore
@stefanches7 zarr is already usable with the hf hub as an array store. See this example from the docs:
import numpy as np
import zarr
embeddings = np.random.randn(50000, 1000).astype("float32")
# Write an array to a repo
with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
foo = root.create_group("embeddings")
foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
foobar[:] = embeddings
# Read an array from a repo
with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
first_row = root["embeddings/experiment_0"][0]
Is there additional functionality that would not be covered by this?
@cakiki I think some tiling capabilities, as well as metadata / labels handling. Consult ome-zarr doc here: https://ome-zarr.readthedocs.io/en/stable/python.html Visualization would be the cherry on the top.
cc @joshmoore @lubianat @St3V0Bay: curious what you think
zarr-specific dataset viewer would be very cool
A support for BIDS it would be perfect, I think it's possible to do all the biosinal can be done with mne. There's a cool community for decoding brain signals, and now with EMG. The new META bracelet EMG is saving things in BIDS.
I can help to interface, coding and try to make this happen. I am available at hugging face discord with the username aristimunha, if some 1-to-1 discuss it would be necessary :)
@lhoestq , @cakiki , do you think we can make this happen?
If you give me the OK, I'll create the PR to make everything for a Biosignal Reader logic, I already studied the nilabel PR :)
That would be an amazing addition ! Feel free to ping me in your PR for review or if you have questions / if I can help
@bruAristimunha @lhoestq I've recalled a gold of a resource for BIDS: https://openneuro.org/
Do you think there is a data-easy way to make those visible here on HuggingFace? Afaik they use datalad to fetch the data. Maybe the best way is to leave OpenNeuro as-is, not connecting it to HuggingFace at all - just an idea I had spontaneously.
I know an "easy" way to create interoperability with all biosignal datasets from OpenNeuro =)
For biosignal data, we can use EEGDash to create a Pytorch dataset, which automates fetch, lazy read, and converts to a pytorch dataset.
I have a question about the best serialization for a Hugging Face dataset, but I can discuss it with some of you on Discord; my username is aristimunha.
I can explain it publicly too, but I think a short 5-minute conversation would be better than many, many texts to explain the details.
It's ok to have discussions in one place here (or in a separate issue if it's needed) - I also generally check github more often than discord ^^'
Hi @bruAristimunha @lhoestq any way we could proceed on this? I see someone posted a Nifti vizualization PR: https://github.com/huggingface/datasets/pull/7874 - I think it would be a shame if we couldn't accompany that by a neat way to import BIDS Nifti!
@stefanches7 author of #7874 here, would be open to expand the current support to BIDS as well after having a brief look. Maybe having a brief call over Discord (my username: TobiasPitters on the huggingface discord server) might help sorting things out, since I am not familiar with BIDS. So getting an understanding over test cases needed, etc. would be great!
Hey!!
From a bids perspective, I can provide full support for all biosignal types (EEG, iEEG, MEG, EMG). BIDS is a well-established contract format; I believe we can design something that supports the entire medical domain. I think it just requires a few details to be aligned.
From my perspective, the tricky part is how to best adapt and serialize from the Hugging Face perspective.
Under the hood, for the biosignal part, I think I would use mne for interoperability and eegdash to create the serialized dataset, but we can definitely discuss this further. I will ping you @CloseChoice on Discord.
had a discussion with @neurolabusc and here's a quick wrap-up:
- BIDS support would be huge (@bruAristimunha would be great if we could catch up on that)
- DICOM support as well, but that might be harder due to a lot of variety in how headers are handled, vendor specifics etc. So to have a reliable pipeline to interact with whole folders of DICOM files (including metadata) would require a lot of work and a lot of testing. Therefore I set https://github.com/huggingface/datasets/pull/7835 back to draft mode. But there are tools that ease the way, especially https://github.com/ImagingDataCommons/highdicom (or potentially https://github.com/QIICR/dcmqi).
- Getting users would help in order to understand what other formats/features are required therefore loading a bunch of open datasets to the hub using the new Nifti feature would be great. Some tutorials might help here as well.
Hi @CloseChoice and @bruAristimunha, glad to meet you both! We could appoint a call; I am currently moving to a new job, so the time slots are limited, but let's connect over Discord and see what we could do.
- BIDS: our hackathon team @zuazo @ekarrieta @lakshya16157 put up a BIDS format converter: https://huggingface.co/spaces/stefanches/OpenBIDSifier. Might be useful for imaging dataset conversion to BIDS.
- DICOM support: cc @St3V0Bay, the author of DICOM support in CroissantML (https://github.com/mlcommons/croissant/pull/942)
cc @nolden
my username is aristimunha within the huggieng face discord to discuss more