datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add DICOM support

Open CloseChoice opened this issue 2 months ago • 5 comments

supports #7804 Add support for the dicom file format.

This PR follows PR #7815 and PR #7325 closely. Remarkable differences: I made sure that we can load all of pydicom's test data, and encountered the force=True parameter that we explicitly support here. This allows to trying to load corrupted dicom files, we explicitly test this!

There is one dataset with all of dicom's test data on huggingface which can be loaded using this branch with the following script:

from datasets import load_dataset
from datasets import Features, ClassLabel
from datasets.features import Dicom

features = Features({
    "dicom": Dicom(force=True),  # necessary to be able to load one corrupted file
    "label": ClassLabel(num_classes=2)
})

ds = load_dataset("TobiasPitters/dicom-sample-dataset",
                  features=features)

error_count = 0

for idx, item in enumerate(ds["test"]):
    dicom = item["dicom"]

    try:
        print(f"Type: {type(dicom)}")
        if hasattr(dicom, 'PatientID'):
            print(f"PatientID: {dicom.PatientID}")
        if hasattr(dicom, 'StudyInstanceUID'):
            print(f"StudyInstanceUID: {dicom.StudyInstanceUID}")
        if hasattr(dicom, 'Modality'):
            print(f"Modality: {dicom.Modality}")
    except Exception as e:
        error_count += 1
        print(e)

print(f"Finished processing with {error_count} errors.")

todo:

  • [x] add docs (will do so soon)

CloseChoice avatar Oct 28 '25 10:10 CloseChoice

Awesome ! For the docs should we rename https://huggingface.co/docs/datasets/nifti_dataset to medical_imaging_dataset and have both DICOM and NIfTI together or have separate pages in you opinion ?

lhoestq avatar Nov 05 '25 14:11 lhoestq

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Awesome ! For the docs should we rename https://huggingface.co/docs/datasets/nifti_dataset to medical_imaging_dataset and have both DICOM and NIfTI together or have separate pages in you opinion ?

Makes sense, is more intuitive for the user and the pages as proposed in this branch have a lot of overlap. I would then structure it in such a way to write some brief things about medical imaging, then introduce the formats (so basically concatenating the two pages together and removing duplicates).

CloseChoice avatar Nov 05 '25 14:11 CloseChoice

Pls don't merge currently, since we'll need an embed_storage function in here as well. See https://github.com/huggingface/datasets/pull/7815#issuecomment-3494094692 and the following conversation

CloseChoice avatar Nov 06 '25 15:11 CloseChoice

@lhoestq, this is ready for a first round of review.

CloseChoice avatar Nov 15 '25 19:11 CloseChoice