croissant Adding support for reading medical images

There is a large need (and no solution) for structured sharing of medical data in context of ML development. Since I have a background in medical image analysis I would like to use Croissant specifically with medical images. Across the domain of cell biology, histo-pathology and radiology, there are may be 5 relevant file formats that need to be considered (each has a go-to Python lib already). Potential support for these file formats would make Croissant a very relevant library in the domain of life science.

(1) Is this generally of interest for the Croissant community? (2) Should this go to a Croissant extension or part of core?

Also, I have mentioned Croissant as a solution in the proposal for this Arpa-h call about large scale exchange platform of imaging data (https://arpa-h.gov/explore-funding/programs/index). I.e. looking into this topic of medical data can be quite opportunistic.

May 05 '25 15:05 steffenvogler

Thanks for getting this effort started! IMO this is a very interesting use case for Croissant.

To start with, I would suggest examining whether any changes are needed to the Croissant spec to support this use case. This would determine whether a Croissant extension is needed or not.

Based on your description, it sounds like the main aspect is support for the 5 file formats used in this domain. If these have associated mime types, then it should be possible to describe the corresponding files in a Croissant description.

To make the MLCroissant python library work with those files, we probably need to integrate with the corresponding python libs for these formats.

Next, beyond describing resources as FileObject / FileSets, we need to map their contents to RecordSets. Is any of these formats structured? If so, is there a natural way to map their contents to Fields? If they are akin to columns, then just specifying the coressponding column ids / names is sufficient. If the data is hierarchical, then we need something like a path language.

May 12 '25 09:05 benjelloun

@steffenvogler cool use case! We are also looking for the ways to integrate Bioimaging data with Croissant: #CroissantML > Publishing Bioimaging Datasets for AI, maybe there could be some synergies with medical imaging. And in my department we also extensively work with medical images, so happy to chat about this too :)

One thing that should be integrated in Croissant could be the RTSTRUCT annotation format - or maybe there are specialized Python libs that could facilitate the integration? As it is essentially a structured format, maybe we could map the annotations to RecordSets / Fields with appropriate semantic annotations.

May 12 '25 09:05 stefanches7

Thank you so much @benjelloun for rough direction of determining complexity of extending Croissant capabilities.

@stefanches7: Oh... I didn't see the MLCroissant Bioimaging thread on image.sc (I am rarely there). Since I am originally from bioimaging, I know both sides of the game and found the distinction between bio and medical imaging very unfortunate and suboptimal. Ideally we join forces, but without being spread thin between too many features.

Just technically, IMHO the file formats in bioimaging are way more advanced than in medical imaging (ZARR & OMERO vs DICOM & SVS) and may be we can help to connect the two domains. Also there is another dependency: there is a separate thread (only on email so far) around a concept of BioCroissant, where *omics data and lab measurements can be stored in a structured way. At the end, any image needs contextual data - otherwise it is a bit obsolete. I can provide more details in a call. I'll DM you on this.

May 13 '25 13:05 steffenvogler

Hi all, really cool to see this take off! We are working on prototypes for omics as well as genetics and prior knowledge (I am a PI at Helmholtz Munich and Open Targets). Would be happy to join the discussion. We already have a pipeline that dynamically ingests Open Targets data using its metadata descriptions (which will be Croissant ML in the next release). Would be awesome to connect also to the medical imaging domain (@stefanches7 thanks for making the link).

May 15 '25 10:05 slobentanzer

@slobentanzer maybe in a wider context the BioCroissant will be interesting too. They do work with omics, not sure about the prior knowledge - see https://docs.google.com/presentation/d/1uREePrWgJjYXySOJ-ylklGhPUHf_FLMcAWeDpHnZpPU/edit?usp=sharing for the intro.

There is apparently also the BioCroissant mailing list, but I could not find the link to it. Maybe you would still have it @steffenvogler ?

May 15 '25 11:05 stefanches7

@stefanches7 - the BioCroissant working group is just about to be officially created and so far there is only a lengthy email thread, that essentially tries to collect experts with potential interest. I will reply to that thread, put you in CC and do a little intro. You can follow up from there.

May 16 '25 09:05 steffenvogler

🤚🏽 I'd be interested in lurking, too.

Aug 22 '25 08:08 joshmoore

@steffenvogler @stefanches7 sorry for randomly bringing this up again, but I don't think I saw an email; can you invite me (possibly again)?

Aug 22 '25 08:08 slobentanzer

@slobentanzer @joshmoore @stefanches7 - Hi there, I went for a low hanging fruit: extending Croissant to DICOM using pydicom - link to PR above. This has a lot of potential for my peers in medical imaging and development of diagnostic tools for the clinic (no compromises with data lineage, extended metadata to detect annotator bias, intended use, responsible AI etc etc). There is no data transformation (that we discussed in a call a few months ago).

I also understand that bioimages and digital pathology are very different beasts to tame. But may be we can built some momentum here. We can do bioformats or openslide next, or even non-image formats. May be we have first topics for BioCroissant already.

Sep 04 '25 13:09 St3V0Bay

Nice, @St3V0Bay! Looking at https://github.com/mlcommons/croissant/commit/9a9512921587367b81c932bf0e68df8225c9a80e I definitely have some thoughts about how to do the same thing for bioimages. For pydicom, I'd suggest going with bioio. For the MIME Type we might have to get creative.

cc: @toloudis

Sep 26 '25 16:09 joshmoore

Great thing, and definitely an inspiration for the Bioimaging world @St3V0Bay ! @joshmoore thanks for the bioio pointer. Another thing caught my attention that I thought is interesting for the round: https://github.com/mlcommons/croissant/pull/883 - FHIR record support for Croissant.

cc @nolden

Oct 09 '25 07:10 stefanches7

Then I'll cc: @ericprud while we're at it ;)

Oct 09 '25 13:10 joshmoore

Nice, @St3V0Bay! Looking at 9a95129 I definitely have some thoughts about how to do the same thing for bioimages. For pydicom, I'd suggest going with bioio. For the MIME Type we might have to get creative.

cc: @toloudis

Yes, a DICOM reader as bioio package (wrapping pydicom or something) would be very interesting...

Oct 09 '25 13:10 toloudis

(1) 3 days ago I have presented this at a conference and 3 companies from healthcare space are interested. For missing features we might join forces

(2) next on my personal list is to enable chunk-wise reading of WSI from remote filesystems (see https://github.com/Bayer-Group/tiffslide), I.e. direct sampling of patches from remote w/o downloading GBs of data

Together with bioio and FHIR support, this becomes pretty impressive IMO.

It seems we have critical momentum to spin-off a BioCroissant extension. I am going to send out an invite early next week…

Cc: @ccl-core

Oct 09 '25 14:10 steffenvogler

Me again: I can admit I am a bit hyped about this activity. We (as a community) were never this close to a lazy self-assembly of datasets with multiple modalities that works across organisations.

Advantages:

cutting days/weeks of data prep
no need for multi-institutional data sharing platforms
no need for complicated access management in scenarios of multi-institution datasets (while keeping data sovereignty)

One can simply create a assembly instructions (i.e. Croissant file) and share this with research partners next door and voila. Data sovereignty remains with the data owner and is controlled conventionally with access tokens (Croissant lib can read from local $ENV and simply try assemble the datasets as far as permissions allow).

It's 100% reproducible and brings loads of the urgently required contextual data (image + EHR + omics + responsible AI details like annotator demographics) right to the fingertips of the researchers/developers.

New research question could be unlocked:

hyper-conditional ML models
"learned" better model forensic (put context data in error backprop)
"smart" bias handling during inference (runtime adaptation)
concept of "sets of datasets", i.e. web-crawl through online (or in-house) data repository when searching for complementary or similar datasets (with Croissant we web-crawled 5000 datasets and clustered them with their embeddings, this is done with a single script)

Looking very much forward to kick this off.

Oct 10 '25 09:10 St3V0Bay

Sounds great @St3V0Bay - currently looking in the bioimages with bioio. It works in one-liner for Zarr, but I could not find anything for WSI - however, in Bio-Formats I did - might be helpful?

Apart from this, very interested in the web-crawler of the Croissants - might that be shareable somewhere, or at least a draft? We could use it for https://github.com/mlcommons/croissant/tree/main/croissant-rdf too then.

cc @david4096

Oct 10 '25 10:10 stefanches7

@luisoala - Stefan asks for the details on the web-crawl exercise with OpenML, but I cannot find the link and docs currently. Do you have any shareables from back then?

Oct 10 '25 10:10 St3V0Bay

@stefanches7: bioio supports all of Bio-Formats but also has some WSI formats natively like CZI.

Oct 10 '25 10:10 joshmoore

hi gang

pierre and i parked all things croissant crawler here https://github.com/mlcommons/croissant/tree/main/health

i might still have the parquets w crawl results somewhere but need to check

it has reports on openml and hf under /visualizer and a live prototype for a dataset universe explorer

two notes:

most vendors have rate limits for web requests
a more recent but also stale version of the index is exposed via our croissant mcp, you can interact w it through your ide/mcp client under following config

{ "mcpServers": { "croissant-mcp": { "url": "http://35.87.210.99:8000/sse", "transport": "sse" } } }

thx for pushing this, i like your framing @St3V0Bay happy to support on a concrete use case/demo

Oct 10 '25 11:10 luisoala