[BUG] Error preprocessing sample.save
Describe the bug I am trying to run Segger on CosMx data (6k genes). I have created the nuclei masks and exported the tx_file in .parquet format. I have followed the script to preprocess [https://github.com/EliHei2/segger_dev/blob/main/scripts/create_data_cosmx.py] the data before training. When I tried to save the object, I got an error
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.43.
warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.64.
warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.68.
warnings.warn(
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 7
1 # Parameters:
2 # - k_bd[/dist_bd](http://localhost:8880/dist_bd): Control nucleus boundary point connections
3 # - k_tx[/dist_tx](http://localhost:8880/dist_tx): Control transcript neighborhood connections
4 # - tile_width[/height](http://localhost:8880/height): Size of spatial tiles for processing
5 # - neg_sampling_ratio: Ratio of negative to positive samples
6 # - val_prob: Fraction of data for validation
----> 7 sample.save(
8 data_dir=SEGGER_DATA_DIR,
9 k_bd=3, # Number of boundary points to connect
10 dist_bd=15, # Maximum distance for boundary connections
11 k_tx=10, # Use calculated optimal transcript neighbors
12 dist_tx=10, # Use calculated optimal search radius
13 tile_width=200, # Tile size for processing,
14 tile_height=200, # Tile size for processing
15 neg_sampling_ratio=10.0, # 5:1 negative:positive samples
16 frac=1.0, # Use all data
17 val_prob=0.3, # 30% validation set
18 test_prob=0, # No test set
19 )
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:471](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=470), in STSampleParquet.save(self, data_dir, k_bd, dist_bd, k_tx, dist_tx, tile_size, tile_width, tile_height, neg_sampling_ratio, frac, val_prob, test_prob)
469 outs = []
470 for region in regions:
--> 471 outs.append(func(region))
472 return outs
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:453](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=452), in STSampleParquet.save.<locals>.func(region)
448 data_type = np.random.choice(
449 a=["train_tiles", "test_tiles", "val_tiles"],
450 p=[1 - (test_prob + val_prob), test_prob, val_prob],
451 )
452 xt = STTile(dataset=xm, extents=tile)
--> 453 pyg_data = xt.to_pyg_dataset(
454 k_bd=k_bd,
455 dist_bd=dist_bd,
456 k_tx=k_tx,
457 dist_tx=dist_tx,
458 neg_sampling_ratio=neg_sampling_ratio,
459 )
460 if pyg_data is not None:
461 if pyg_data["tx", "belongs", "bd"].edge_index.numel() == 0:
462 # this tile is only for testing
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:1238](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=1237), in STTile.to_pyg_dataset(self, neg_sampling_ratio, k_bd, dist_bd, k_tx, dist_tx, area, convexity, elongation, circularity)
1235 polygons = gpd.GeoSeries(self.boundaries[geometry_column], index=self.boundaries.index)
1236 else:
1237 # Fallback: compute polygons
-> 1238 polygons = utils.get_polygons_from_xy(
1239 self.boundaries,
1240 x=self.settings.boundaries.x,
1241 y=self.settings.boundaries.y,
1242 label=self.settings.boundaries.label,
1243 scale_factor=self.settings.boundaries.scale_factor,
1244 )
1246 # Ensure self.boundaries is a GeoDataFrame with correct geometry
1247 self.boundaries = gpd.GeoDataFrame(self.boundaries.copy(), geometry=polygons)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py:189](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py#line=188), in get_polygons_from_xy(boundaries, x, y, label, scale_factor)
186 part_offset = np.arange(len(np.unique(ids)) + 1)
188 # Convert to GeoSeries of polygons
--> 189 polygons = shapely.from_ragged_array(
190 shapely.GeometryType.POLYGON,
191 coords=boundaries[[x, y]].values.copy(order="C"),
192 offsets=(geometry_offset, part_offset),
193 )
194 gs = gpd.GeoSeries(polygons, index=np.unique(ids))
196 # print(gs)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:467](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=466), in from_ragged_array(geometry_type, coords, offsets)
465 return _linestring_from_flatcoords(coords, *offsets)
466 elif geometry_type == GeometryType.POLYGON:
--> 467 return _polygon_from_flatcoords(coords, *offsets)
468 elif geometry_type == GeometryType.MULTIPOINT:
469 return _multipoint_from_flatcoords(coords, *offsets)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:400](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=399), in _polygon_from_flatcoords(coords, offsets1, offsets2)
397 offsets2 = np.asarray(offsets2, dtype="int64")
399 # recreate polygons
--> 400 result = _from_ragged_array_multi_linear(
401 coords, offsets1, offsets2, geometry_type=GeometryType.POLYGON
402 )
403 return result
File shapely[/_geometry_helpers.pyx:511](http://localhost:8880/_geometry_helpers.pyx#line=510), in shapely._geometry_helpers._from_ragged_array_multi_linear()
File shapely[/_geometry_helpers.pyx:537](http://localhost:8880/_geometry_helpers.pyx#line=536), in shapely._geometry_helpers._from_ragged_array_multi_linear()
File shapely[/_geometry_helpers.pyx:123](http://localhost:8880/_geometry_helpers.pyx#line=122), in shapely._geometry_helpers._create_simple_geometry_raise_error()
ValueError: A linearring requires at least 4 coordinates.
Expected behavior
Should save the tiled data into the test, train, and validation folders for the training part.
When reading CosMx data into a spatialdata object and saving the polygon data, I had issues when the polygons were incomplete and broke the code. Skipping those fixed the problem. Is this a similar issue? Or something else?
Environment (please complete the following information):
- OS: Linux (HPC)
- Python version: 3.11.13
- Package version: 0.1.0
Hi @ccruizm thanks for reporting the issue. We are still experimenting with CosMx, therefore the branch is not stable yet and is subject to many changes in the coming weeks. in the meantime, we suggest you follow this notebook to preprocess the data in a format which is compatible for segger's use: https://github.com/EliHei2/segger_dev/blob/generic_config/platform_guides/platform_preparation_guide.ipynb
pinging @andrewmoorman to follow up on this thread.
Hello @EliHei2 , Thanks for the speedy reply. I used the script suggested in the main branch to generate the nuclei boundaries and exported the transcripts as .parquet. But I see the notebook you shared has some advice/suggestions I did not implement. Thanks for pointing out in the right direction! I will use this and let you know how it goes.
Great to hear there is still development ongoing for CosMx data as well! In the meantime, should I still install the main branch? or do you recommend another one?
@ccruizm sorry for the late response. you could find the notebooks on this branch: https://github.com/EliHei2/segger_dev/tree/generic_config and therefore please install this branch, or switch to this branch once cloned.