spatialdata-io
spatialdata-io copied to clipboard
Stereoseq expected directory structure
Hi Team,
Would it be possible to document the expected directory structure for the stereoseq reader? The results we got from the stereoseq team don't follow the directory structure being assumed by this implementation. So, if it's properly documented, like the folder names to expect and which files should be put in which folder, then we can manually restructure our directories to conform to this stereoseq reader implementation.
As an example, this is the results folder we receive from the Stereoseq team:
Thanks a bunch.
Thanks @aadimator for reporting this. @LLehner could you please have a look into this?
- [ ] Precisely, please I would add a line in the docstring of the
stereoseqfunction specifying whichstereoseqdata version we expected and adding a link to the technical document from the STOmics website that specifies the file structure.
@aadimator which version of the data is the screenshot referring to?
This comment: https://github.com/scverse/spatialdata-io/pull/70#issuecomment-1658529103 is from July 31st, 2023, therefore I believe that the reader is designed for the format 7.0.0: https://github.com/STOmics/SAW/tree/0808e44619f84b67d44c063b2fd24762f6633051/Documents/FileFormat and that the latest 7.1.1 (or even 7.1.0) is not supported.
Thanks @aadimator for reporting this. @LLehner could you please have a look into this?
- [ ] Precisely, please I would add a line in the docstring of the
stereoseqfunction specifying whichstereoseqdata version we expected and adding a link to the technical document from the STOmics website that specifies the file structure.@aadimator which version of the data is the screenshot referring to?
We got this as is from the Stereoseq Team, and I think it's not following any particular format. I think I'll have to manually rename/place the files into their relative/expected directories. I'll try to follow the SAW 7.0.0. format for now.
saw 8.0 has new output directory structure
from this manual: https://en.stomics.tech/service/new-saw-operation-manual.html
Thank you for the comment. For the moment I will restrict or document that the reader operates only on 7.x. Unfortunately we don't have the bandwidth to support the latest version at the moment. But a community contribution is welcomed and we would be happy to review the code in such case.
Todo for us:
- [ ] restrict or document that the
stereoseqreader only works for 7.x data.
Hi Luca, I tried https://github.com/brainfo/spatialdata-io/blob/main/src/spatialdata_io/readers/stereoseq.py
This works with the "folder structure" from SAW v8; for also a duplicate issue #322
Side note 1: datasets from stormics website are not with a folder structure but files all in one directory https://en.stomics.tech/col1357/index ; to test on output folder structure from SAW v8, we could create such structure and put the data from the website. For now I tested on in-house data, showed following replaced the real id to {sample_id}
Side note 2: In my practice, actually, for collaboration project, we only transfer necessary data, but not the entire folder structure where @z-spider copied. Unnecessary folder and files are: bam/; feature_expression/{sample_id}.raw.gef; feature_expression/.txt; analysis/.marker_features.csv; visualization.tar.gz*; {sample_id}.report.html); doesn't hurt to keep them, i.e., the outs/ folder intact with all files. Below is a minimal example:
outs
├── analysis *optional when load_analysis=False
│ ├── {sample_id}.bin20_1.0.h5ad
│ └── {sample_id}.bin50_1.0.h5ad
├── feature_expression
│ └── {sample_id}.tissue.gef *required
└── image *optional
├── {sample_id}_HE_regist.tif
└── {sample_id}_HE_tissue_cut.tif
4 directories, 5 files
from spatialdata_io import stereoseq_v8
sdata = stereoseq_v8('path/to/outs') # outs/ in the saw8 output directory
then I got an sdata
SpatialData object
├── Images
│ ├── '{sample_id}_HE_regist': DataTree[cyx] (3, 23520, 23520), (3, 11760, 11760), (3, 5880, 5880), (3, 2940, 2940), (3, 1470, 1470)
│ └── '{sample_id}_HE_tissue_cut': DataTree[cyx] (1, 23520, 23520), (1, 11760, 11760), (1, 5880, 5880), (1, 2940, 2940), (1, 1470, 1470)
├── Points
│ ├── 'analysis_bin20_points': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'analysis_bin50_points': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin1_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin5_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin10_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin20_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin50_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin100_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ ├── 'bin150_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
│ └── 'bin200_genes': DataFrame with shape: (<Delayed>, 2) (2D points)
└── Tables
├── 'analysis_bin20': AnnData (199772, 28999)
├── 'analysis_bin50': AnnData (33366, 28999)
├── 'bin1_table': AnnData (2636982, 28999)
├── 'bin5_table': AnnData (1496312, 28999)
├── 'bin10_table': AnnData (652510, 28999)
├── 'bin20_table': AnnData (199772, 28999)
├── 'bin50_table': AnnData (33366, 28999)
├── 'bin100_table': AnnData (8575, 28999)
├── 'bin150_table': AnnData (3888, 28999)
└── 'bin200_table': AnnData (2241, 28999)
with coordinate systems:
▸ 'global', with elements:
{sample_id}_HE_regist (Images), {sample_id}_HE_tissue_cut (Images), analysis_bin20_points (Points), analysis_bin50_points (Points), bin1_genes (Points), bin5_genes (Points), bin10_genes (Points), bin20_genes (Points), bin50_genes (Points), bin100_genes (Points), bin150_genes (Points), bin200_genes (Points)
Hi, great to hear that you found a workaround. Thanks for sharing!