Add DICOM support
supports #7804 Add support for the dicom file format.
This PR follows PR #7815 and PR #7325 closely.
Remarkable differences:
I made sure that we can load all of pydicom's test data, and encountered the force=True parameter that we explicitly support here. This allows to trying to load corrupted dicom files, we explicitly test this!
There is one dataset with all of dicom's test data on huggingface which can be loaded using this branch with the following script:
from datasets import load_dataset
from datasets import Features, ClassLabel
from datasets.features import Dicom
features = Features({
"dicom": Dicom(force=True), # necessary to be able to load one corrupted file
"label": ClassLabel(num_classes=2)
})
ds = load_dataset("TobiasPitters/dicom-sample-dataset",
features=features)
error_count = 0
for idx, item in enumerate(ds["test"]):
dicom = item["dicom"]
try:
print(f"Type: {type(dicom)}")
if hasattr(dicom, 'PatientID'):
print(f"PatientID: {dicom.PatientID}")
if hasattr(dicom, 'StudyInstanceUID'):
print(f"StudyInstanceUID: {dicom.StudyInstanceUID}")
if hasattr(dicom, 'Modality'):
print(f"Modality: {dicom.Modality}")
except Exception as e:
error_count += 1
print(e)
print(f"Finished processing with {error_count} errors.")
todo:
- [x] add docs (will do so soon)
Awesome ! For the docs should we rename https://huggingface.co/docs/datasets/nifti_dataset to medical_imaging_dataset and have both DICOM and NIfTI together or have separate pages in you opinion ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Awesome ! For the docs should we rename https://huggingface.co/docs/datasets/nifti_dataset to medical_imaging_dataset and have both DICOM and NIfTI together or have separate pages in you opinion ?
Makes sense, is more intuitive for the user and the pages as proposed in this branch have a lot of overlap. I would then structure it in such a way to write some brief things about medical imaging, then introduce the formats (so basically concatenating the two pages together and removing duplicates).
Pls don't merge currently, since we'll need an embed_storage function in here as well. See
https://github.com/huggingface/datasets/pull/7815#issuecomment-3494094692 and the following conversation
@lhoestq, this is ready for a first round of review.