sopa icon indicating copy to clipboard operation
sopa copied to clipboard

CosMx input file format requirements

Open Jwong684 opened this issue 2 months ago • 11 comments

Hi Sopa development team,

Sopa looks like a perfect fit for what I've been trying to do with integrating IF stains with Baysor re-segmentation. However, I've been having trouble figuring out how to run the data properly since the newest CosMx output data (and all the column names) are quite different from the example ones on their website. I'm not exactly sure what columns/root file names sopa is looking for. Do you have any documentation there?

I am currently trying to run your Snakefile. From your FAQ, I see that you need these three components: data_path is the directory containing (i) the transcript file (ending with _tx_file.csv or _tx_file.csv.gz), (ii) the FOV locations file, and (iii) a Morphology2D directory containing the images.

I presumed that these files were based on this CosMx README: https://nanostring-public-share.s3.us-west-2.amazonaws.com/SMI-Compressed/SMI-ReadMe.html

The Morphology2D directory has files in this format:

20240215_023634_S3_C902_P99_N99_F001.TIF
20240215_023634_S3_C902_P99_N99_F002.TIF

for FOV001 and FOV002

The transcripts file looks like this:

,CellComp,CellId,Spot1_count,Spot2_count,Spot3_count,Spot4_count,codeclass,fov,multicolor_spots_per_feature,possible_BC_count,random_call_probability,seed_x,seed_y,spots_per_feature,target,target_call_observations,target_count_per_feature,target_idx,x,y,z
0,Cytoplasm,1,1,1,1,1,Endogenous,1,0,1,0.0025906,40612,14474,4,CD86,4,1,935,4061.07,1447.53,0
1,Nuclear,3,1,1,1,1,SystemControl,1,1,2,0.00517446,41620,14490,4,SystemControl100,4,1,537,4162.15,1449.18,0
2,Nuclear,2,1,2,1,2,Endogenous,1,0,1,0.0025906,42113,14517,4,RPL37,6,1,937,4211.38,1451.83,0

I'm not entirely sure what the FOV locations file is supposed to look like - there are x and y coordinates for each transcript and for each cell, though it seems like sopa is looking for one X and Y coordinate for each FOV if I'm understanding it right.

I get errors like this:

rule to_spatialdata:
    input: sopa/data/fov001
    output: sopa/data/fov001.zarr/.zgroup
    jobid: 4
    reason: Missing output files: sopa/data/fov001.zarr/.zgroup
    resources: tmpdir=/tmp, mem_mb=128000, mem_mib=122071

Activating conda environment: ../../../envs/env_sopa
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│conda_environments/radian/lib/python3.10/site-packages/pandas/core/inde │
│ xes/base.py:3805 in get_loc                                                                      │
│                                                                                                  │
│   3802 │   │   """                                                                               │
│   3803 │   │   casted_key = self._maybe_cast_indexer(key)                                        │
│   3804 │   │   try:                                                                              │
│ ❱ 3805 │   │   │   return self._engine.get_loc(casted_key)                                       │
│   3806 │   │   except KeyError as err:                                                           │
│   3807 │   │   │   if isinstance(casted_key, slice) or (                                         │
│   3808 │   │   │   │   isinstance(casted_key, abc.Iterable)                                      │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ─────────────────────────────────────────╮     │
│ │ casted_key = 'X_mm'                                                                      │     │
│ │        key = 'X_mm'                                                                      │     │
│ │       self = Index(['Unnamed: 0', 'CellId', 'Spot1_count', 'Spot2_count', 'Spot3_count', │     │
│ │              │      'Spot4_count', 'codeclass', 'fov', 'multicolor_spots_per_feature',   │     │
│ │              │      'possible_BC_count', 'random_call_probability', 'seed_x', 'seed_y',  │     │
│ │              │      'spots_per_feature', 'target', 'target_call_observations',           │     │
│ │              │      'target_count_per_feature', 'target_idx', 'x', 'y', 'z'],            │     │
│ │              │     dtype='object')                                                       │     │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────╯     │
│                                                                                                  │
│ in pandas._libs.index.IndexEngine.get_loc:167                                                    │
│                                                                                                  │
│ in pandas._libs.index.IndexEngine.get_loc:196                                                    │
│                                                                                                  │
│ in pandas._libs.hashtable.PyObjectHashTable.get_item:7081                                        │
│                                                                                                  │
│ in pandas._libs.hashtable.PyObjectHashTable.get_item:7089                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'X_mm'

In this case, X_mm isn't a column header in the sample data or in the newest CosMx output.

Thank you for your help!

Jwong684 avatar May 13 '24 01:05 Jwong684