proseg Clarification on voxel layers setting and geojson for visualization of cell segmentation

I've been testing Proseg on a few of my merfish data sets and seen encouraging improvement in results. While I've been able to use the outputs for generating objects for downstream clustering and analysis I'm still struggling to re-format the cell boundary outputs for plugging back into the merscope VPT command line and updating the visualization file to browse the data in this way. Can you please inform me on two aspects:

I have 8 z-planes of data in each experiment. When running Proseg, should I adjust --voxel-layers to be 8 rather than the default of 4? Can you please clarify what the --voxel-layers are specifying and if it's recommended to adjust this parameter? I have run the data using both the default of 4 and with 8 and seen that in my cell-polygon-layers.geojson files I have 4 and 32 layers, respectively. I had originally re-ran it with the updated 8 number because I believed that this would adjust the layers in my final geojson to also be 8, representing the coordinates of the polygon boundaries for each cell across the 8 z-stacks. I'm confused why this is 32 and what this setting should be. I'm also unsure if 'layers' in this file should correspond with the layer in the z-stack (which is what I was assuming).
With the above mentioned cell-polygon-layers.geojson I am then generating a parquet file to read in for the final step of making the visualization file. I've done this with Baysor previously and used 3D segmentation, with the z-layers specified in the parquet. Is this method also okay for visualization of the Proseg results? I'm unsure if I should be using the 3D cell-polygons-layers.geojson or if it's better to use the 2D cell-polygons.geojson file.

Thank you!

Aug 06 '25 19:08 aubreyhouser27

This was a source of confusion beacues --voxel-layers sets the initial number of layers, but resolution is increasing while sampling to arrive at 4 times that number. I just released proseg 3 where --voxel-layers does what's expected and will correspond with the number of output layers.
I think this should be fine. The cell polygon layers is a essentially the true cell boundaries that proseg infers. The flattened cell polygons (--cell-polygons) are non-overlapping consensus 2d polygons that are useful mainly for visualization.

Let me know if I can clarify anything else.

Aug 07 '25 17:08 dcjones

Thank you! I've re-ran some of my samples using proseg 3 and see that updated change with the voxel layers. I do have a couple of follow-up question though based on the changes to the outputs that are generated in proseg 3:

For comparing across segmentation methods, I've been calculating percentage assigned transcripts (similar to what is done in Figure 2A of Jones et al). With the proseg 2 outputs I was filtering cells with >10 transcripts and volumes>100 and then calculating percentage assigned transcripts using 'population' in the cell metadata and the number of rows in the transcripts metadata file (tx metadata file = 'detected_tx').

`cell_metadata$population <- as.numeric(cell_metadata$population) cell_metadata$volume <- as.numeric(cell_metadata$volume)

cell_metadata_filt <- cell_metadata[cell_metadata$population > 10 & cell_metadata$volume > 100, ]

pct.tx.within.cell <- (sum(cell_metadata_filt$population))/(nrow(detected_tx))`

In proseg 3 there is no 'population'/transcript count assignment to cells included in the cell metadata. I've tried the following as an alternative for calculating the percentage of transcripts using the cell x gene matrix, cell and gene metadata files:

`# ---- paths ---- mtx_path <- "path/expected-counts.mtx" cell_meta <- "path/cell-metadata.csv" gene_meta <- "path/gene-metadata.csv"

'# ---- read metadata ---- cells <- fread(cell_meta) genes <- fread(gene_meta)

n_cells <- nrow(cells) n_genes <- nrow(genes)

'# ---- read sparse matrix ---- M <- readMM(mtx_path) # sparse dgTMatrix

'# Detect orientation exactly as you already do if (nrow(M) == n_genes && ncol(M) == n_cells) { assigned_per_cell <- Matrix::colSums(M) # genes x cells expected_by_gene <- Matrix::rowSums(M) } else if (nrow(M) == n_cells && ncol(M) == n_genes) { assigned_per_cell <- Matrix::rowSums(M) # cells x genes expected_by_gene <- Matrix::colSums(M) } else { stop(sprintf("Matrix dims %d x %d don't match cells=%d, genes=%d", nrow(M), ncol(M), n_cells, n_genes)) } '## added some checks: '# 1) Matrix total equals sum of per-cell equals sum of per-gene stopifnot(all.equal(sum(M), sum(assigned_per_cell), sum(expected_by_gene)))

'# 2) if ("expected_assigned_count" %in% names(genes)) { ok <- all.equal(as.numeric(expected_by_gene), as.numeric(genes$expected_assigned_count), tolerance = 1e-6) print(ok) # should be TRUE (tiny float diffs ok) }

'# 3) Attach to cells cells[, expected_assigned := as.numeric(assigned_per_cell)]

'# find the volume column vol_col <- grep("^vol|volume$", names(cells), ignore.case = TRUE, value = TRUE) if (length(vol_col) == 0) stop("Couldn't find a volume column in cell metadata.") setnames(cells, vol_col[1], "volume")

'# ---- filter cells: keep >=10 transcripts AND volume >=100 ---- cells_filt <- cells[expected_assigned >= 10 & volume >= 100]

'# ---- numerator: transcripts assigned to KEPT cells ---- num_assigned_kept <- sum(cells_filt$expected_assigned)

'# ---- denominator: total detected transcripts in the dataset ---- '# (cell + background) from gene metadata total_detected <- sum(genes$total_count)

'# ---- percentage assigned to a cell after filtering ---- pct_assigned_after_filter <- 100 * num_assigned_kept / total_detected

'# see baseline before filtering num_assigned_all <- sum(cells$expected_assigned) pct_assigned_before_filter <- 100 * num_assigned_all / total_detected

list( total_detected = total_detected, assigned_before_filter = num_assigned_all, pct_before_filter = pct_assigned_before_filter, assigned_after_filter = num_assigned_kept, pct_after_filter = pct_assigned_after_filter, n_cells_kept = nrow(cells_filt), n_cells_total = n_cells ) `

I see a large discrepancy in the results that I'm getting on the same sample after running proseg 2 versus proseg 3. For two of my samples the jumps were from percentages in the 90s to in the mid-30s and 70s when switching to proseg 3. As such my questions are the following:

1. Do you have a recommended way to accurately calculate the percentage of assigned transcripts within cells after filtering? I'm questioning if I'm grabbing the best metadata columns for calculating this metric.

2. Is there expected to be such a large difference in assigned transcripts between proseg 2 and 3? My anticipation was that this switch would improve the accuracy so I'm surprised by these large decrease in assigned transcripts.

Thank you for the help!

Aug 12 '25 19:08 aubreyhouser27

I would do something like this:

# You can also read the whole table with `read.csv`, but this is faster if you just need the number of transcripts
ntranscripts <- length(count.fields("transcript-metadata.csv.gz", skip=1))

X <- readMM("expected-count.mtx.gz")
cells <- read.csv("cells.csv.gz")

# filter output
mask <- (rowSums(X) >= 10) & (cells$volume > 100)
cells <- cells[mask,]
X <- X[mask,]

proportion_assigned <- sum(X) / ntranscripts

This discrepancy might be explained by changing how volume in calculated. In proseg 3 the z coordinate is always normalized between [0, 1] which may make the volume numbers smaller. Do you see the same discrepancy without the volume filter?

Aug 13 '25 13:08 dcjones