spatialdata-io
spatialdata-io copied to clipboard
How to load Merscope data missing cell_boundaries.parquet file?
Hi there,
Is there a way to load older version of merscope data which store the cell boundaries as several hdf5 files rather than a parquet file? I'm hoping to test the API on publicly available merscope data made available by Vizgen which stores the cell boundaries as hdf5 files.
Thanks!
Hi, unfortunately our readers don't support the old file format. We are currently working on adding the support for other techs (seqFISH, Stereo-seq, output of MCMICRO), we could maybe explore the old MERSCOPE format after that.
But meanwhile, if you would like to make a contribution on that it would be very welcomed, and we would be happy to provide feedback/support on a PR made on that.
No worries @LucaMarconato. I'll be keen to make a PR for that. Let me have a poke around the codebase and see what I can do.
Thanks a lot! Feel free to reach out of any question, also via chat if more convenient https://imagesc.zulipchat.com/#narrow/stream/329057-scverse
Hi, below is a function to convert old MERSCOPE data to new one (at least to open things in SpatialData. I have no new version in hand so it might be that things are slightly different. @LucaMarconato I don't think this should be in the io for MERSCOPE and I'm pretty sure it's hacky and not a performant way. You can share it wherever you want.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon
from shapely import MultiPolygon
import os
from tqdm import tqdm
def read_boundary_hdf5(folder):
all_boundaries = {}
boundaries = None
for i in tqdm(os.listdir(folder + '/cell_boundaries/')):
with h5py.File(folder + '/cell_boundaries/' + i, "r") as f:
for key in f['featuredata'].keys():
if boundaries is not None:
boundaries.loc[key] = MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]) # doesn't matter which zIndex we use, MultiPolygon to work with read function in spatialdata-io
else:
boundaries = gpd.GeoDataFrame(index=[key], geometry=MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]))
all_boundaries[i] = boundaries
boundaries = None
all_concat = pd.concat(list(all_boundaries.values()))
all_concat = all_concat[~all_concat.index.duplicated(keep='first')] # hdf5 can contain duplicates with same cell_id and position, removing those
all_concat.rename_geometry('geometry_renamed', inplace=True) # renaming to make it compatible with spatialdata-io
all_concat["EntityID"] = all_concat.index # renaming to make it compatible with spatialdata-io
all_concat['ZIndex'] = 0 # adding to make it compatible with spatialdata-io
all_concat.to_parquet(folder + '/cell_boundaries.parquet')
count_path = f"{folder}/cell_by_gene.csv"
obs_path = f"{folder}/cell_metadata.csv"
data = pd.read_csv(count_path, index_col=0)
obs = pd.read_csv(obs_path, index_col=0)
data.index = obs.index.astype(str) # data index in old software is range(n_obs)
data.index.name = "cell" # renaming to make it compatible with spatialdata-io
obs.index.name = 'EntityID' # renaming to make it compatible with spatialdata-io
data.to_csv(count_path)
obs.to_csv(obs_path)```
Thanks a lot @canergen for sharing the code!
Thanks a lot for the code, I did have to adapt one line to make it work for me:
#boundaries = gpd.GeoDataFrame(index=[key], geometry=MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]))
boundaries = gpd.GeoDataFrame(geometry=gpd.GeoSeries(MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]),index=[key]))