Adding support for reading medical images
There is a large need (and no solution) for structured sharing of medical data in context of ML development. Since I have a background in medical image analysis I would like to use Croissant specifically with medical images. Across the domain of cell biology, histo-pathology and radiology, there are may be 5 relevant file formats that need to be considered (each has a go-to Python lib already). Potential support for these file formats would make Croissant a very relevant library in the domain of life science.
(1) Is this generally of interest for the Croissant community? (2) Should this go to a Croissant extension or part of core?
Also, I have mentioned Croissant as a solution in the proposal for this Arpa-h call about large scale exchange platform of imaging data (https://arpa-h.gov/explore-funding/programs/index). I.e. looking into this topic of medical data can be quite opportunistic.
Thanks for getting this effort started! IMO this is a very interesting use case for Croissant.
To start with, I would suggest examining whether any changes are needed to the Croissant spec to support this use case. This would determine whether a Croissant extension is needed or not.
Based on your description, it sounds like the main aspect is support for the 5 file formats used in this domain. If these have associated mime types, then it should be possible to describe the corresponding files in a Croissant description.
To make the MLCroissant python library work with those files, we probably need to integrate with the corresponding python libs for these formats.
Next, beyond describing resources as FileObject / FileSets, we need to map their contents to RecordSets. Is any of these formats structured? If so, is there a natural way to map their contents to Fields? If they are akin to columns, then just specifying the coressponding column ids / names is sufficient. If the data is hierarchical, then we need something like a path language.
@steffenvogler cool use case! We are also looking for the ways to integrate Bioimaging data with Croissant: #CroissantML > Publishing Bioimaging Datasets for AI, maybe there could be some synergies with medical imaging. And in my department we also extensively work with medical images, so happy to chat about this too :)
One thing that should be integrated in Croissant could be the RTSTRUCT annotation format - or maybe there are specialized Python libs that could facilitate the integration? As it is essentially a structured format, maybe we could map the annotations to RecordSets / Fields with appropriate semantic annotations.
Thank you so much @benjelloun for rough direction of determining complexity of extending Croissant capabilities.
@stefanches7: Oh... I didn't see the MLCroissant Bioimaging thread on image.sc (I am rarely there). Since I am originally from bioimaging, I know both sides of the game and found the distinction between bio and medical imaging very unfortunate and suboptimal. Ideally we join forces, but without being spread thin between too many features.
Just technically, IMHO the file formats in bioimaging are way more advanced than in medical imaging (ZARR & OMERO vs DICOM & SVS) and may be we can help to connect the two domains. Also there is another dependency: there is a separate thread (only on email so far) around a concept of BioCroissant, where *omics data and lab measurements can be stored in a structured way. At the end, any image needs contextual data - otherwise it is a bit obsolete. I can provide more details in a call. I'll DM you on this.
Hi all, really cool to see this take off! We are working on prototypes for omics as well as genetics and prior knowledge (I am a PI at Helmholtz Munich and Open Targets). Would be happy to join the discussion. We already have a pipeline that dynamically ingests Open Targets data using its metadata descriptions (which will be Croissant ML in the next release). Would be awesome to connect also to the medical imaging domain (@stefanches7 thanks for making the link).
@slobentanzer maybe in a wider context the BioCroissant will be interesting too. They do work with omics, not sure about the prior knowledge - see https://docs.google.com/presentation/d/1uREePrWgJjYXySOJ-ylklGhPUHf_FLMcAWeDpHnZpPU/edit?usp=sharing for the intro.
There is apparently also the BioCroissant mailing list, but I could not find the link to it. Maybe you would still have it @steffenvogler ?
@stefanches7 - the BioCroissant working group is just about to be officially created and so far there is only a lengthy email thread, that essentially tries to collect experts with potential interest. I will reply to that thread, put you in CC and do a little intro. You can follow up from there.
🤚🏽 I'd be interested in lurking, too.
@steffenvogler @stefanches7 sorry for randomly bringing this up again, but I don't think I saw an email; can you invite me (possibly again)?
@slobentanzer @joshmoore @stefanches7 - Hi there, I went for a low hanging fruit: extending Croissant to DICOM using pydicom - link to PR above. This has a lot of potential for my peers in medical imaging and development of diagnostic tools for the clinic (no compromises with data lineage, extended metadata to detect annotator bias, intended use, responsible AI etc etc). There is no data transformation (that we discussed in a call a few months ago).
I also understand that bioimages and digital pathology are very different beasts to tame. But may be we can built some momentum here. We can do bioformats or openslide next, or even non-image formats. May be we have first topics for BioCroissant already.
Nice, @St3V0Bay! Looking at https://github.com/mlcommons/croissant/commit/9a9512921587367b81c932bf0e68df8225c9a80e I definitely have some thoughts about how to do the same thing for bioimages. For pydicom, I'd suggest going with bioio. For the MIME Type we might have to get creative.
cc: @toloudis
Great thing, and definitely an inspiration for the Bioimaging world @St3V0Bay ! @joshmoore thanks for the bioio pointer. Another thing caught my attention that I thought is interesting for the round: https://github.com/mlcommons/croissant/pull/883 - FHIR record support for Croissant.
cc @nolden
Then I'll cc: @ericprud while we're at it ;)
Nice, @St3V0Bay! Looking at 9a95129 I definitely have some thoughts about how to do the same thing for bioimages. For
pydicom, I'd suggest going with bioio. For the MIME Type we might have to get creative.cc: @toloudis
Yes, a DICOM reader as bioio package (wrapping pydicom or something) would be very interesting...
(1) 3 days ago I have presented this at a conference and 3 companies from healthcare space are interested. For missing features we might join forces
(2) next on my personal list is to enable chunk-wise reading of WSI from remote filesystems (see https://github.com/Bayer-Group/tiffslide), I.e. direct sampling of patches from remote w/o downloading GBs of data
Together with bioio and FHIR support, this becomes pretty impressive IMO.
It seems we have critical momentum to spin-off a BioCroissant extension. I am going to send out an invite early next week…
Cc: @ccl-core
Me again: I can admit I am a bit hyped about this activity. We (as a community) were never this close to a lazy self-assembly of datasets with multiple modalities that works across organisations.
Advantages:
- cutting days/weeks of data prep
- no need for multi-institutional data sharing platforms
- no need for complicated access management in scenarios of multi-institution datasets (while keeping data sovereignty)
One can simply create a assembly instructions (i.e. Croissant file) and share this with research partners next door and voila. Data sovereignty remains with the data owner and is controlled conventionally with access tokens (Croissant lib can read from local $ENV and simply try assemble the datasets as far as permissions allow).
It's 100% reproducible and brings loads of the urgently required contextual data (image + EHR + omics + responsible AI details like annotator demographics) right to the fingertips of the researchers/developers.
New research question could be unlocked:
- hyper-conditional ML models
- "learned" better model forensic (put context data in error backprop)
- "smart" bias handling during inference (runtime adaptation)
- concept of "sets of datasets", i.e. web-crawl through online (or in-house) data repository when searching for complementary or similar datasets (with Croissant we web-crawled 5000 datasets and clustered them with their embeddings, this is done with a single script)
Looking very much forward to kick this off.
Sounds great @St3V0Bay - currently looking in the bioimages with bioio. It works in one-liner for Zarr, but I could not find anything for WSI - however, in Bio-Formats I did - might be helpful?
Apart from this, very interested in the web-crawler of the Croissants - might that be shareable somewhere, or at least a draft? We could use it for https://github.com/mlcommons/croissant/tree/main/croissant-rdf too then.
cc @david4096
@luisoala - Stefan asks for the details on the web-crawl exercise with OpenML, but I cannot find the link and docs currently. Do you have any shareables from back then?
@stefanches7: bioio supports all of Bio-Formats but also has some WSI formats natively like CZI.
hi gang
pierre and i parked all things croissant crawler here https://github.com/mlcommons/croissant/tree/main/health
i might still have the parquets w crawl results somewhere but need to check
it has reports on openml and hf under /visualizer and a live prototype for a dataset universe explorer
two notes:
- most vendors have rate limits for web requests
- a more recent but also stale version of the index is exposed via our croissant mcp, you can interact w it through your ide/mcp client under following config
{ "mcpServers": { "croissant-mcp": { "url": "http://35.87.210.99:8000/sse", "transport": "sse" } } }
thx for pushing this, i like your framing @St3V0Bay happy to support on a concrete use case/demo