segger_dev icon indicating copy to clipboard operation
segger_dev copied to clipboard

[BUG] Error preprocessing sample.save

Open ccruizm opened this issue 6 months ago • 3 comments

Describe the bug I am trying to run Segger on CosMx data (6k genes). I have created the nuclei masks and exported the tx_file in .parquet format. I have followed the script to preprocess [https://github.com/EliHei2/segger_dev/blob/main/scripts/create_data_cosmx.py] the data before training. When I tried to save the object, I got an error

[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.43.
  warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.64.
  warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.68.
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 7
      1 # Parameters:
      2 # - k_bd[/dist_bd](http://localhost:8880/dist_bd): Control nucleus boundary point connections
      3 # - k_tx[/dist_tx](http://localhost:8880/dist_tx): Control transcript neighborhood connections
      4 # - tile_width[/height](http://localhost:8880/height): Size of spatial tiles for processing
      5 # - neg_sampling_ratio: Ratio of negative to positive samples
      6 # - val_prob: Fraction of data for validation
----> 7 sample.save(
      8     data_dir=SEGGER_DATA_DIR,
      9     k_bd=3,  # Number of boundary points to connect
     10     dist_bd=15,  # Maximum distance for boundary connections
     11     k_tx=10,  # Use calculated optimal transcript neighbors
     12     dist_tx=10,  # Use calculated optimal search radius
     13     tile_width=200,  # Tile size for processing,
     14     tile_height=200,  # Tile size for processing
     15     neg_sampling_ratio=10.0,  # 5:1 negative:positive samples
     16     frac=1.0,  # Use all data
     17     val_prob=0.3,  # 30% validation set
     18     test_prob=0,  # No test set
     19 )

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:471](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=470), in STSampleParquet.save(self, data_dir, k_bd, dist_bd, k_tx, dist_tx, tile_size, tile_width, tile_height, neg_sampling_ratio, frac, val_prob, test_prob)
    469 outs = []
    470 for region in regions:
--> 471     outs.append(func(region))
    472 return outs

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:453](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=452), in STSampleParquet.save.<locals>.func(region)
    448 data_type = np.random.choice(
    449     a=["train_tiles", "test_tiles", "val_tiles"],
    450     p=[1 - (test_prob + val_prob), test_prob, val_prob],
    451 )
    452 xt = STTile(dataset=xm, extents=tile)
--> 453 pyg_data = xt.to_pyg_dataset(
    454     k_bd=k_bd,
    455     dist_bd=dist_bd,
    456     k_tx=k_tx,
    457     dist_tx=dist_tx,
    458     neg_sampling_ratio=neg_sampling_ratio,
    459 )
    460 if pyg_data is not None:
    461     if pyg_data["tx", "belongs", "bd"].edge_index.numel() == 0:
    462         # this tile is only for testing

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:1238](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=1237), in STTile.to_pyg_dataset(self, neg_sampling_ratio, k_bd, dist_bd, k_tx, dist_tx, area, convexity, elongation, circularity)
   1235     polygons = gpd.GeoSeries(self.boundaries[geometry_column], index=self.boundaries.index)
   1236 else:
   1237     # Fallback: compute polygons
-> 1238     polygons = utils.get_polygons_from_xy(
   1239         self.boundaries,
   1240         x=self.settings.boundaries.x,
   1241         y=self.settings.boundaries.y,
   1242         label=self.settings.boundaries.label,
   1243         scale_factor=self.settings.boundaries.scale_factor,
   1244     )
   1246 # Ensure self.boundaries is a GeoDataFrame with correct geometry
   1247 self.boundaries = gpd.GeoDataFrame(self.boundaries.copy(), geometry=polygons)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py:189](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py#line=188), in get_polygons_from_xy(boundaries, x, y, label, scale_factor)
    186 part_offset = np.arange(len(np.unique(ids)) + 1)
    188 # Convert to GeoSeries of polygons
--> 189 polygons = shapely.from_ragged_array(
    190     shapely.GeometryType.POLYGON,
    191     coords=boundaries[[x, y]].values.copy(order="C"),
    192     offsets=(geometry_offset, part_offset),
    193 )
    194 gs = gpd.GeoSeries(polygons, index=np.unique(ids))
    196 # print(gs)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:467](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=466), in from_ragged_array(geometry_type, coords, offsets)
    465     return _linestring_from_flatcoords(coords, *offsets)
    466 elif geometry_type == GeometryType.POLYGON:
--> 467     return _polygon_from_flatcoords(coords, *offsets)
    468 elif geometry_type == GeometryType.MULTIPOINT:
    469     return _multipoint_from_flatcoords(coords, *offsets)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:400](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=399), in _polygon_from_flatcoords(coords, offsets1, offsets2)
    397 offsets2 = np.asarray(offsets2, dtype="int64")
    399 # recreate polygons
--> 400 result = _from_ragged_array_multi_linear(
    401     coords, offsets1, offsets2, geometry_type=GeometryType.POLYGON
    402 )
    403 return result

File shapely[/_geometry_helpers.pyx:511](http://localhost:8880/_geometry_helpers.pyx#line=510), in shapely._geometry_helpers._from_ragged_array_multi_linear()

File shapely[/_geometry_helpers.pyx:537](http://localhost:8880/_geometry_helpers.pyx#line=536), in shapely._geometry_helpers._from_ragged_array_multi_linear()

File shapely[/_geometry_helpers.pyx:123](http://localhost:8880/_geometry_helpers.pyx#line=122), in shapely._geometry_helpers._create_simple_geometry_raise_error()

ValueError: A linearring requires at least 4 coordinates.

Expected behavior Should save the tiled data into the test, train, and validation folders for the training part. When reading CosMx data into a spatialdata object and saving the polygon data, I had issues when the polygons were incomplete and broke the code. Skipping those fixed the problem. Is this a similar issue? Or something else?

Environment (please complete the following information):

  • OS: Linux (HPC)
  • Python version: 3.11.13
  • Package version: 0.1.0

ccruizm avatar Jun 25 '25 19:06 ccruizm

Hi @ccruizm thanks for reporting the issue. We are still experimenting with CosMx, therefore the branch is not stable yet and is subject to many changes in the coming weeks. in the meantime, we suggest you follow this notebook to preprocess the data in a format which is compatible for segger's use: https://github.com/EliHei2/segger_dev/blob/generic_config/platform_guides/platform_preparation_guide.ipynb

pinging @andrewmoorman to follow up on this thread.

EliHei2 avatar Jun 26 '25 09:06 EliHei2

Hello @EliHei2 , Thanks for the speedy reply. I used the script suggested in the main branch to generate the nuclei boundaries and exported the transcripts as .parquet. But I see the notebook you shared has some advice/suggestions I did not implement. Thanks for pointing out in the right direction! I will use this and let you know how it goes.

Great to hear there is still development ongoing for CosMx data as well! In the meantime, should I still install the main branch? or do you recommend another one?

ccruizm avatar Jun 26 '25 11:06 ccruizm

@ccruizm sorry for the late response. you could find the notebooks on this branch: https://github.com/EliHei2/segger_dev/tree/generic_config and therefore please install this branch, or switch to this branch once cloned.

EliHei2 avatar Jul 04 '25 13:07 EliHei2