proseg MERSCOPE Default Config

Hey @dcjones, thank you for this great tool. It works quite well on my MERSCOPE samples but I wanted to tweak it a bit for further optimisation. My first segmentation round was using the default --merscope preset. Would you mind listing the default options/arguments in the merscope preset so I know what settings to tweak?

Thank you!

Luc

Sep 17 '24 06:09 marsdenl

Hi Luc,

There's really no attempt at this point to tweak default parameters based on the platform. That may change in the future with more testing, but all --merscope does now is tell proseg what the input file looks like.

There are various thing you can tweak, which I should try to document more. Probably the most impactful are:

Transcript repositioning parameters:
- --diffusion-probability: larger allows more transcripts to be repositioned
- --diffusion-sigma-far: larger allows transcripts to be moved further
Voxel size and sampling schedule
- --initial-voxel-size: voxel size in micrometers
- --schedule: number of iterations, or a comma separated list of counts where voxel size is halved between rounds
Prior segmentation
- --nuclear-reassignment-prob: lower to enforce conforming to the initial nuclear segmentation

Sep 17 '24 20:09 dcjones

That's already super useful to know, thanks for the information :)

Sep 18 '24 00:09 marsdenl

Hey @dcjones. Thanks again for listing the settings above. Do you expect them to change the size of the masks generated? I've been having the issue that a single cell (based on transcript density and DAPI stain) seems to be receiving multiple small mass instead of a single large covering the entire diameter (see 2 examples below): combined

Do you have any recommendations for adjusting input settings to correct for that?

I was also wondering for perimeter bound, what is the value range expected? Ie how high can we go?

Thanks for your help :)

Luc

Sep 25 '24 04:09 marsdenl

Hi Luc,

These polygons look a little peculiar to me. I'm not sure what you're using to plot these, but it's doing some sort of simplification or modification of the proseg polygons. Proseg is voxel based, so unmodified polygons should only have straight edges perpendicular to the axes. It's possible the polygons you are getting are already better than they appear here.

Also make sure you are plotting the consensus 2d polygons that proseg outputs, not a particular voxel layer.

Sep 25 '24 20:09 dcjones

I used the Vizgen Visualiser and converted the proseg .geojson to Vizgen's .parquet file required using the vizgen VPT post processing tool.

Perhaps some modification occurs during these steps so I also plotted the polygons in R: geojson <- st_read("/Users/lucmarsden/PROSEG/Region0/cell-polygons.geojson") transcript_met <- read.csv("/Users/lucmarsden/PROSEG/Region0/transcript-metadata.csv") cell_meta <- read.csv("/Users/lucmarsden/PROSEG/Region0/cell-metadata.csv") st_crs(geojson) <- NA

cell_by_fov <- split(cell_meta$cell, cell_meta$fov) cells <- unlist(cell_by_fov['60'])

filt_polygons = subset(geojson, cell %in% cells) filt_transcripts = subset(transcript_met, fov == 60)

ggplot() + geom_sf(data = filt_polygons, color = "red") + geom_point(data = filt_transcripts, aes(x = x, y = y), size = 0.0001, alpha = 0.1) + theme_minimal()

conbined

And python: import geopandas as gpd import matplotlib.pyplot as plt import pandas as pd

geojson_file = "/Users/lucmarsden/PROSEG/Region0/cell-polygons.geojson" gdf = gpd.read_file(geojson_file) gdf = gdf.set_crs(None, allow_override=True)

transcript_file = "/Users/lucmarsden/PROSEG/Region0/transcript-metadata.csv" transcripts = pd.read_csv(transcript_file)

cell_meta_file = "/Users/lucmarsden/PROSEG/Region0/cell-metadata.csv" cell_meta = pd.read_csv(cell_meta_file)

target_fov = 60

cells_by_fov = cell_meta[cell_meta['fov'] == target_fov]['cell'] filt_polygons = gdf[gdf['cell'].isin(cells_by_fov)] filt_transcripts = transcripts[transcripts['fov'] == target_fov]

fig, ax = plt.subplots(figsize=(10, 10)) filt_polygons.plot(ax=ax, color='lightblue', edgecolor='black') ax.scatter(filt_transcripts['x'], filt_transcripts['y'], color='red', s=1, alpha=0.5) ax.et_aspect('auto') plt.title(f"Polygons and Transcripts for FOV {target_fov}") plt.show()

Do they look peculiar still? I still get multiple small polygons per single cell...

Cheers

Sep 26 '24 02:09 marsdenl

Ok, these look much more like what I'd expect to see.

Especially in cells with a relatively low number of transcripts, you might expect to see peculiar voxel "appendages", but proseg shouldn't be able to produce fully disconnected cell pieces. I can't tell if that's happening here or if those are just boundaries of tiny cells (which may also be a symptom of low transcript count). If it's definitely producing disconnected pieces frequently, I'll try to investigate and figure out why.

Sep 26 '24 23:09 dcjones

I think the disconnected cell pieces are still happening. I tried to find the same ROIs showed on a R shiny app, and those showed on Vizgen's visualiser. Although there are some good cells (see black/white circle), I'd say a larger majority are disconnected (red).

On R shiny: Rshiny

On merscope visualiser:

Is that something that can be changed through initial settings or is a bug do you reckon?

Cheers

Luc

Sep 27 '24 01:09 marsdenl

My suspicion is that the nuclear segmentation that proseg is initialized with oversegmented and split these into multiple cells. Currently proseg is somewhat anchored to this input. It can't merge or split cells, it only tries to infer better borders, so if it's initialized with too many cells it's not able to correct that and has to make due.

If you can, check the the input segmentation and see if it has these same errors. If that's the case, the solution is probably to manually run cellpose on the nuclear stain, which unfortunately is a bit of a hassle, but I can probably share some code to help.

Sep 27 '24 16:09 dcjones

I see, that makes a lot of sense. I just checked and you're absolutely right. The initial input segmentation has the same errors. The black/white cell in the FOV above was the only cell with a good initial nuclear segmentation.

Re-doing cellpose might be the way to go! If you have some code to share, I'll gladly take you up on that offer but only if it's not too much hassle :)

Thanks

Luc

Sep 29 '24 02:09 marsdenl

I think the disconnected cell pieces are still happening. I tried to find the same ROIs showed on a R shiny app, and those showed on Vizgen's visualiser. Although there are some good cells (see black/white circle), I'd say a larger majority are disconnected (red).

On R shiny:

On merscope visualiser:

Is that something that can be changed through initial settings or is a bug do you reckon?

Cheers

Luc Hi, Luc

Do you mind if you can share how you update .VZG file using the VPT tool after performing proseg? Thank you!

Kun

Nov 24 '24 14:11 KunHHE

Hey @KunHHE,

Sure! Vizgen's VPT convert geometries function is compatible with .geojson files. Although on Windows it struggles a bit, on MAC I can simply run on the CLI:

vpt --verbose convert-geometry
--input-boundaries /path_to_input_proseg.geojson
--output-boundaries /path_to_output_proseg.parquet

Once you have the .parquet file you run partition transcripts to derive your cell_gene matrix

vpt --verbose partition-transcripts
--input-boundaries proseg.parquet
--input-transcripts detected_transcripts.csv
--output-entity-by-gene cell_by_gene.csv

And finally update your .vzg file using your proseg.parquet file and your cell_gene matrix.

vpt --verbose update-vzg
--input-vzg old.vzg
--input-boundaries proseg.parquet
--input-entity-by-gene cell_by_gene.csv
--output-vzg new.vzg

It's quite important to note however that this is purely for visualisation purposes. Proseg does not keep the IDs from the initial MERSCOPE segmentation (new cells IDs are simply row numbers for each sample). But when converting geometries, it seems VPT assigns new IDs to the polygons. So when you look at your cell_gene matrix or if you look at the number of transcripts per cell when hovering over a cell in the MERSCOPE Visualiser, the numbers display are completely wrong. I've not found a way around this...Let me know if you do :)

Hope this makes sense. Let me know if not!

Nov 26 '24 00:11 marsdenl

Thanks very much! I did not know that VPT can convert-geometry. My laptop is windows, I update the .VZG file successfully in VPT.

Nov 26 '24 02:11 KunHHE

Fantastic :)

On Tue, 26 Nov 2024 at 13:21, KunHHE @.***> wrote:

Thanks very much! I did not know that VPT can convert-geometry. My laptop is windows, I update the .VZG file successfully in VPT.

— Reply to this email directly, view it on GitHub https://github.com/dcjones/proseg/issues/34#issuecomment-2499506975, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJPKM7BTQXPN23M7ZIEV2SL2CPLKFAVCNFSM6AAAAABOKY4XSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJZGUYDMOJXGU . You are receiving this because you modified the open/close state.Message ID: @.***>

Nov 26 '24 04:11 marsdenl

Fantastic :) … Hi @marsdenl do you mind if I keep asking some naive question? I am a new pipeline runner, so we generate .csv.gz and .geojson files from Proseg, then we generate cell_by_gene.csv in VPT. The matrix is pretty much ready for the downstream analysis, I did use Seurat pipeline for merscope data running, so before we create seurat project, what files have to be put in? The pipeline I have run using original Vizgen outputs: 1"cell_boundaries.parquet", 2"cell_by_gene.csv", 3"cell_metadata.csv", 4"partitioned_transcripts.csv". Since the cell_by_gene is updated with new geojson coordinates from proseg, should we also get new cell_metadata.csv and detected_transcripts.csv? so we can run it in seurat? Sorry those are too naive, but want to perform it correctly. Best!

Nov 26 '24 19:11 KunHHE

No worries at all. Ive done my fair share of asking questions on Github also :) Like I said earlier, there is a cell_ID mismatch between Proseg and VPT so any subsequent output from VPT such as VPT's cell_matadata and cell_gene matrix is false (at least in my hands - but do verify this yourself). So if using Proseg segmentation, I would not use the cell_metadata.csv and cell_gene.csv file from VPT. I would use the cell_gene matrix from Proseg directly (I think it's called expected_counts.csv, or maxpost_counts.csv for integer values) and the cell_metadata.csv from Proseg also. For either of those files, the cell_Id is the row number so row 1 in cell_metadata and cell_gene matrix is cell #1. I have not used Seurat for downstream analyses but it seems using Proseg's cell_metadata and expected_counts.csv or maxpost_counts.csv might be enough. I am not sure what partitioned transcripts is. Is it detected transcripts? If so I think it is optional, but it depends why Seurat uses detected transcripts in the first place. It's worth noting, you might have to do some formatting to read the Proseg outputs into Seurat depending on what Seurat expects (colnames, datatype etc). Hope this helps!

Nov 26 '24 21:11 marsdenl