[BUG] Errors with Dataset Creation

Open PietroD opened this issue 5 months ago • 1 comments

Describe the bug I am having a hard time to make segger works even from the loading data steps. I have installed the main branch.

When I follow the Introduction to Segger tutorial on https://elihei2.github.io/segger_dev, after running:

merscope_data_dir = Path('/beegfs/scratch/prj/Spatial/data/merscope/human_brain_1k')
segger_data_dir = Path('/beegfs/scratch/prj/Spatial/results/merscope/human_brain_1k/segger')

sample = STSampleParquet(
    base_dir=merscope_data_dir,
    n_workers=4,
    sample_type='merscope'
)

I get:

Traceback (most recent call last):
  File "/opt/common/tools/ric.iannacone/envs/segger-env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7096, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'global_x'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/beegfs/scratch/prj/Spatial/benchmark_ist/code/segger/src/segger/data/parquet/sample.py", line 85, in __init__
    utils.ensure_transcript_ids(
  File "/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/parquet/_utils.py", line 499, in ensure_transcript_ids
    df = add_transcript_ids(df, x_col, y_col, id_col, precision)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/parquet/_utils.py", line 445, in add_transcript_ids
    x_coords = np.round(transcripts_df[x_col] * precision).astype(int).astype(str)
                        ~~~~~~~~~~~~~~^^^^^^^
  File "/opt/common/tools/envs/segger-env/lib/python3.11/site-packages/pandas/core/frame.py", line 4107, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/common/tools/envs/segger-env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3819, in get_loc
    raise KeyError(key) from err
KeyError: 'global_x'

When I instead run the Merscope dataset creation, with:

from segger.data import MerscopeSample

First I get:

>>> from segger.data import MerscopeSample
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'MerscopeSample' from 'segger.data' (/beegfs/scratch/prj/Spatial/code/segger/src/segger/data/__init__.py)

Then I get around with:

from segger.data.io import MerscopeSample

But when I try:

# Create a MerscopeSample instance for spatial transcriptomics processing
merscope_sample = MerscopeSample()

# Load transcripts from a CSV file
merscope_sample.load_transcripts(
    base_path=merscope_data_dir,
    sample=sample_tag,
    transcripts_filename="detected_transcripts.csv",
    file_format="csv"
)

I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/beegfs/scratch/ric.iannacone/ric.iannacone/prj/Spatial/benchmark_ist/code/segger/src/segger/data/io.py", line 130, in load_transcripts
    raise ValueError("This version only supports parquet files with Dask.")
ValueError: This version only supports parquet files with Dask.

Expected behavior A clear and concise description of what you expected to happen.

OS: Ubuntu 22.04.5 LTS
Python version: 3.11.13
Package version: segger_dev main branch

Jul 31 '25 10:07 PietroD

Dear @PietroD thanks for reporting the bug. We have put together a generic platform config: https://github.com/EliHei2/segger_dev/blob/generic_config/platform_guides/platform_preparation_guide.ipynb

the idea is to have 2 files: 1) transcripts.parqeut file with x,y,z, as well as the initial cell_id's and if the transcript is overlapping a nucleus/boundary. 2) the baoundaries.parquet file that indicates the boundary geometries. That is in order to run on the nulceus mode, you first need to segment nuclei using an imaging based segmentation on MERSCOPE outputs. see: https://github.com/EliHei2/segger_dev/tree/main/platform_guides/vizgen_merscope.

We are aware this is too much to decode and we're currently improving the docs/APIs/tech support. pinging @andrewmoorman to be in the loop and add sth if I missed.

Jul 31 '25 10:07 EliHei2