proseg Question regarding returned cell size distribution

Hi,

First of thank you for making this tool from my initial usage of it it seems to return great results.

I have a question regarding the returned cells. I have noticed that in the distribution of returned cell sizes therer is a small portion of cells that are unrealistically bigger than the other cells, often encompassing several cells and their surrounding area.

I can filter these based on size, and it removes those background cells to the point where I believe the segmentation is now usable

However I have some remaining questions:

why would the algorithm return such big cells in the background of the image?
Is there a more precise way to filter out these cells? (I am currently using a minimum threshold)
Does the existence of these large background cells affect the returned segmentation?

I am running proseg with enforce-connectivity, as well max-transcript-nucleus-distance=20.

Thank you for your help, Amit.

Jul 15 '25 21:07 a3klein

Hi Amit,

Because proseg tries to construct cell boundaries to explain the observed transcripts, usually when I see implausible large cells has a few possible causes:

There should be multiple cells, but that was missed by the prior segmentation, so proseg expands the borders to account for the transcripts from the missing cells.
Some quantity of transcripts have leaked from the cell, so proseg expands the border to account for these.
Proseg may lack the resolution to resolve a very complex morphology (e.g. a neuron), and just infer something much larger to encompass it the cell.

From the point of view of analyzing gene expression (2) or (3) isn't a big deal, because the counts will be relatively accurate, even if the morphology is questionable. (1) is probably a bigger concern, and if it looks like that's the issue, it might be worth using different prior segmentation (e.g. cellpose with a less stringent threshold).

Jul 15 '25 23:07 dcjones

I agree with your reasons (2) and (3), and I do believe that the neuronal morphologies are probably causing some of the issues here. However, I don't believe that the prior segmentation missed so many cells, there are simply too many of those 'artifact' cells and they cover ~>90% of properly segmented cells. It seems to me that those artifact cells encompass multiple smaller cells, which will lead to a 'non-biological' cell identity.

I tried two things, one was lossening the max-transcript-nucleus-distance parameter, and a second was to lower the number of components in the model without the change to the distance parameter. Both of those changes led to the removal of those large artifact cells in the results. If I had to assume which result is more biologically representative (without doing any quantitative comparisons) I would think that lowering the ncomponents parameter is a better solution than allowing the cell boundaries to expand without limitation.

Jul 16 '25 17:07 a3klein

@a3klein How did you calculated cell sizes? on 2D polygons.

Aug 17 '25 20:08 amnahsiddiqa

I ran into the same/similar issue with my xenium data - neighboring cells that were not detected by prior segmentation were swallowed up by the proseg segmentation, sometimes leading to merges of exitatory/inhibitory neurons etc. Reducing the --max-transcript-nucleus-distance dramatically from default 60 to 10 seems to largely solve this issue, with some exceptions in cells that are tightly packed. Reducing ncomponents did nothing with my data. In mouse brain tissue. I was not able to improve much on the prior segmentation with cellpose, relative to native xenium segmentation.

Some example segmentations, the smaller shape is the xenium segmentation, the larger shape is proseg. Napari didn't want me to be able to change colors, sorry... See particularly the two cells in the center that are merged, except for with ndist 10 (--max-transcript-nucleus-distance 10). ncomponents 1:

ncomponents 5:

ndist 10:

ndist 20:

Aug 21 '25 14:08 Sverreg

That's an interesting example. Unfortunately, I don't yet have a great plan how to improve this situation. Proseg does it's best to explain the observed transcripts with the cells it's given. So if two cells are adjacent and only one is detected in the prior segmentation, it will generally expand the detected cell to explain the other cells transcripts.

Excluding distant transcripts like you did is one way to improve things. I should also add better options to control cell size, which would help avoid some more extreme cases.

Aug 21 '25 16:08 dcjones

Hi,

I've noticed similar things. I think that my original issue stemmed from some initial cells from my cellpose segmentation either being wrongly segmented, or segmenting cells for which there were DAPI stains but the probe set was not specifically targeted to them (like white matter tracts in some brain tissues). I belive this caused some of the learned cell representations to correspond to the background signal in my image. A way around this that I have found is to perform a relaxed filtering on the number of transcripts detected in the initial segmentation, which will get rid of those "empty" cells and help the proseg algorithm learn representations for the cell types you are interested in.

I also noticed that changing the ncomponents parameter had minimal effects on those artifacts, and the max-transcript-nucleus-distance was more effective at removing those arficacts. However, with proseg 3.0.4 I have been noticing that setting the max-transcript-nucleus-distance parameter to 10 results in ~50-60 % of my transcripts being called as background transcripts. At the same time I noticed that proseg 3 always returns the same number of cells as the initial segmentation (I am using --prior-seg-reassignment-prob - 0.05). I am wondering whether this is an added constraint in the current version of Proseg, and whether instead of removing cells, it pseudo removes cells by setting all of the transcripts in a cell it deems as unreliable to background? In all of my trials where the background transcripts are 50-60% I get that roughly half of all my cells get gene counts of 0.

Having a parametr to control cell size (like the maximum dilation an initial segmentatin can experience) while still maintaining all the transcripts in an image might solve this issue, as it will allow for the model to learn the background representations better and not overlap them with real cells.

Aug 21 '25 17:08 a3klein

That's an interesting example. Unfortunately, I don't yet have a great plan how to improve this situation. Proseg does it's best to explain the observed transcripts with the cells it's given. So if two cells are adjacent and only one is detected in the prior segmentation, it will generally expand the detected cell to explain the other cells transcripts.

Excluding distant transcripts like you did is one way to improve things. I should also add better options to control cell size, which would help avoid some more extreme cases.

I'd say the data is not perfect and neither can the segmentation be, but I'm very happy with my final outcome with ndist 10. Without improving my prior segmentation I don't think it would be reasonable to expect a better result. Although, in most of these cases, the bounday stretches out in one direction to capture a neighbor cell - maybe there could be some constraint to circularity around the nucleus. Such a parameter might already exist for all I know! And this would not solve anything for two tightly packed nuclei where only one is detected in prior segmentation.

Examples in the top left and btm right of this cutout:

Aug 22 '25 11:08 Sverreg