[BUG] Errors with Dataset Creation
Describe the bug I am having a hard time to make segger works even from the loading data steps. I have installed the main branch.
When I follow the Introduction to Segger tutorial on https://elihei2.github.io/segger_dev, after running:
merscope_data_dir = Path('/beegfs/scratch/prj/Spatial/data/merscope/human_brain_1k')
segger_data_dir = Path('/beegfs/scratch/prj/Spatial/results/merscope/human_brain_1k/segger')
sample = STSampleParquet(
base_dir=merscope_data_dir,
n_workers=4,
sample_type='merscope'
)
I get:
Traceback (most recent call last):
File "/opt/common/tools/ric.iannacone/envs/segger-env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
return self._engine.get_loc(casted_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7096, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'global_x'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/beegfs/scratch/prj/Spatial/benchmark_ist/code/segger/src/segger/data/parquet/sample.py", line 85, in __init__
utils.ensure_transcript_ids(
File "/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/parquet/_utils.py", line 499, in ensure_transcript_ids
df = add_transcript_ids(df, x_col, y_col, id_col, precision)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/parquet/_utils.py", line 445, in add_transcript_ids
x_coords = np.round(transcripts_df[x_col] * precision).astype(int).astype(str)
~~~~~~~~~~~~~~^^^^^^^
File "/opt/common/tools/envs/segger-env/lib/python3.11/site-packages/pandas/core/frame.py", line 4107, in __getitem__
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/common/tools/envs/segger-env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3819, in get_loc
raise KeyError(key) from err
KeyError: 'global_x'
When I instead run the Merscope dataset creation, with:
from segger.data import MerscopeSample
First I get:
>>> from segger.data import MerscopeSample
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'MerscopeSample' from 'segger.data' (/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/__init__.py)
Then I get around with:
from segger.data.io import MerscopeSample
But when I try:
# Create a MerscopeSample instance for spatial transcriptomics processing
merscope_sample = MerscopeSample()
# Load transcripts from a CSV file
merscope_sample.load_transcripts(
base_path=merscope_data_dir,
sample=sample_tag,
transcripts_filename="detected_transcripts.csv",
file_format="csv"
)
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/beegfs/scratch/ric.iannacone/ric.iannacone/prj/Spatial/benchmark_ist/code/segger/src/segger/data/io.py", line 130, in load_transcripts
raise ValueError("This version only supports parquet files with Dask.")
ValueError: This version only supports parquet files with Dask.
Expected behavior A clear and concise description of what you expected to happen.
- OS: Ubuntu 22.04.5 LTS
- Python version: 3.11.13
- Package version: segger_dev main branch
Dear @PietroD thanks for reporting the bug. We have put together a generic platform config: https://github.com/EliHei2/segger_dev/blob/generic_config/platform_guides/platform_preparation_guide.ipynb
the idea is to have 2 files: 1) transcripts.parqeut file with x,y,z, as well as the initial cell_id's and if the transcript is overlapping a nucleus/boundary. 2) the baoundaries.parquet file that indicates the boundary geometries. That is in order to run on the nulceus mode, you first need to segment nuclei using an imaging based segmentation on MERSCOPE outputs. see: https://github.com/EliHei2/segger_dev/tree/main/platform_guides/vizgen_merscope.
We are aware this is too much to decode and we're currently improving the docs/APIs/tech support. pinging @andrewmoorman to be in the loop and add sth if I missed.