spatialdata-io icon indicating copy to clipboard operation
spatialdata-io copied to clipboard

How to load Merscope data missing cell_boundaries.parquet file?

Open ghar1821 opened this issue 2 years ago • 6 comments

Hi there,

Is there a way to load older version of merscope data which store the cell boundaries as several hdf5 files rather than a parquet file? I'm hoping to test the API on publicly available merscope data made available by Vizgen which stores the cell boundaries as hdf5 files.

Thanks!

ghar1821 avatar Jul 19 '23 09:07 ghar1821

Hi, unfortunately our readers don't support the old file format. We are currently working on adding the support for other techs (seqFISH, Stereo-seq, output of MCMICRO), we could maybe explore the old MERSCOPE format after that.

But meanwhile, if you would like to make a contribution on that it would be very welcomed, and we would be happy to provide feedback/support on a PR made on that.

LucaMarconato avatar Jul 20 '23 14:07 LucaMarconato

No worries @LucaMarconato. I'll be keen to make a PR for that. Let me have a poke around the codebase and see what I can do.

ghar1821 avatar Jul 20 '23 23:07 ghar1821

Thanks a lot! Feel free to reach out of any question, also via chat if more convenient https://imagesc.zulipchat.com/#narrow/stream/329057-scverse

LucaMarconato avatar Jul 21 '23 09:07 LucaMarconato

Hi, below is a function to convert old MERSCOPE data to new one (at least to open things in SpatialData. I have no new version in hand so it might be that things are slightly different. @LucaMarconato I don't think this should be in the io for MERSCOPE and I'm pretty sure it's hacky and not a performant way. You can share it wherever you want.

import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon
from shapely import MultiPolygon
import os
from tqdm import tqdm


def read_boundary_hdf5(folder):
    all_boundaries = {}
    boundaries = None
    for i in tqdm(os.listdir(folder + '/cell_boundaries/')):
        with h5py.File(folder + '/cell_boundaries/' + i, "r") as f:
            for key in f['featuredata'].keys():
                if boundaries is not None:
                    boundaries.loc[key] = MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]) # doesn't matter which zIndex we use, MultiPolygon to work with read function in spatialdata-io
                else:
                    boundaries = gpd.GeoDataFrame(index=[key], geometry=MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]))
            all_boundaries[i] = boundaries
            boundaries = None
    all_concat = pd.concat(list(all_boundaries.values()))
    all_concat = all_concat[~all_concat.index.duplicated(keep='first')] # hdf5 can contain duplicates with same cell_id and position, removing those
    all_concat.rename_geometry('geometry_renamed', inplace=True)  # renaming to make it compatible with spatialdata-io
    all_concat["EntityID"] = all_concat.index  # renaming to make it compatible with spatialdata-io
    all_concat['ZIndex'] = 0  # adding to make it compatible with spatialdata-io
    all_concat.to_parquet(folder + '/cell_boundaries.parquet')
    
    count_path = f"{folder}/cell_by_gene.csv"
    obs_path = f"{folder}/cell_metadata.csv"

    data = pd.read_csv(count_path, index_col=0)
    obs = pd.read_csv(obs_path, index_col=0)

    data.index = obs.index.astype(str) # data index in old software is range(n_obs)
    data.index.name = "cell" # renaming to make it compatible with spatialdata-io
    obs.index.name = 'EntityID' # renaming to make it compatible with spatialdata-io
    data.to_csv(count_path)
    obs.to_csv(obs_path)```

canergen avatar Oct 01 '23 07:10 canergen

Thanks a lot @canergen for sharing the code!

LucaMarconato avatar Oct 10 '23 10:10 LucaMarconato

Thanks a lot for the code, I did have to adapt one line to make it work for me:

#boundaries = gpd.GeoDataFrame(index=[key], geometry=MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]))
 boundaries = gpd.GeoDataFrame(geometry=gpd.GeoSeries(MultiPolygon([Polygon(f['featuredata'][key]['zIndex_3']['p_0']['coordinates'][()][0])]),index=[key]))

lopollar avatar Nov 21 '23 17:11 lopollar