spatialdata-io
spatialdata-io copied to clipboard
Verify compatibility with Xenium Onboard Analysis 3.0
A new version of XOA has been released, which supports the new Xenium Prime 5K panels; from the changelog I believe that no modifications are required to for the xenium() reader to support the new format.
- [ ] Still we should verify this.
- [x] Also, we should add the new small test datasets to the CI.
The small dataset "Xenium_Prime_MultiCellSeg_Mouse_Ileum_tiny" seems to be invalid. I added the other one in the GitHub workflow that prepares the test data.
- [x] tests need to be added.
Hi there, I'd like to chime in on this - I'm working with a dataset from XOA v3.2.1.2
I'm noticing that the cell_id field in my table does not align with the index in my cell_boundaries, also the size of my cell_boundaries is modestly different than my cell_labels.
I'm running latest versions of spatialdata and spatial data-io
Hi @benemead thanks for reaching out. Are you working on a public dataset/can you reproduce on a public dataset? Happy to assist.
@timtreis are you using the same XOA version?
Hi @LucaMarconato, appreciate the prompt reply!
Unfortunately not public, and our data is coming from a 3rd party who runs the instrument.
If there's a way to abbreviate or anonymize my current data, I'd gladly share (also xenium slides are massive - this one's clocking in it ~800k labels).
From inspection of some of the outputs in the xenium dir (cell .csv.gz files) I can see that what's been loaded for 'cell_id' does not match - rather looks like the hex conversion is starting from 0 and increasing one by one.
Happy to take a pass at it myself and report back if you all had pointers - I'm pretty unfamiliar with this codebase.
I checked the changelog for XOA 3.2.1 https://www.10xgenomics.com/support/software/xenium-onboard-analysis/latest/release-notes/release-notes-for-xoa and I don't think the problem is tied to that version. It could be instead due to fact that from XOA 3.0.0 one could have cells with no nuclei, or cells with multiple nuclei.
A way to share anonymized data could be the following:
- extract indices from cell labels and nuclei using
spatialdata.get_element_instances(), save as a series into 2.csvfiles - save indices of cell boundaries and nucleus boundaries into 2
.csvfiles - given
_, region_key, instance_key = spatialdata.models.get_table_key(sdata['table']), extract theregion_keyandinstance_keycolumns fromsdata['table'].obsinto a.csvfile. If you could share the above it would be great.
Also, you could check what you can share from the file cells.zarr.zip. This contains some important metadata used to link the nuclei with the table.
Finally, please note that for Xenium data, all the code for parsing is contained in a single file (<800 lines of code) https://github.com/scverse/spatialdata-io/blob/main/src/spatialdata_io/readers/xenium.py, so if you could try to debug it and share more information on where you get the error, or which value the variables have when you get the error, this could help a lot!
Hi @LucaMarconato - apologies for my slow reply - have dug into the issue a bit more, and have attached the .obs columns as table_metadata.csv, the cell_labels element instances as cell_labels.csv, and the cell_boundaries index as cells_boundaries.csv
What you'll see is that the cell_id (from Xenium) is not preserved in the cell_labels, however it is present in the cell_boundaries - based on my review of the code you referenced above it should be converting the hashed Xenium cell_id to that alpha string? Maybe I'm misunderstanding?
Actually - I think I may have found the issue - in line 228,
cell_labels_indices_mappingis defined, and below there is a conditional test to see if the mapping matches thecell_ids, but then the mapping (AFAIK) is not used again...
NVM - I see - the cell_labels is meant to just be an int - and presumably I need to use the mapping between cell_labels and cell_id in table.obs to map between the two?
Exactly. cell_labels matches the integer values for the pixels in the labels element. Instead cell_id is the index of the cells that is used to compute the hex representation. Please let me know if with this information the problem is still open or if it was due to the ambiguity now explained.
- [ ] We need also to parse the
morphology.ome.tifimage and check that all is good with the Z-stack. Example dataset: https://www.10xgenomics.com/datasets/xenium-prime-ffpe-human-ovarian-cancer (3.0.0). Reported by @BioinfoTongLI
just some supplementary info for the morphology.ome.tif is a multiZ tiff file with dimension order Z, Y, X.
DAPI channel-only. Not any other channels.