proseg cell filtering performed during proseg-to-baysor?

i have an issue that seems related to #73 but is probably just me not understanding the output.

as far as i can tell from the proseg output (cell-metadata, cell-polygons, expected-counts, etc.), which was run with "--min-qv 20", my sample has 419759 cells. however, after running proseg to baysor, the length of the geometries in baysor-cell-polygons.geojson is 418062 (although the highest indexed "cell" in the json file is in fact 419758 (0-based)). so there are cells that are not making it into the baysor version of the data.

thus when I attempt to add the segmentation to Xenium for visualization, I obviously have a mismatch with my data (since I loaded from expected-counts). this prevents me e.g. from uploading a new cell annotation into the Xenium folder so my users can visualize it.

one solution is for me is to identify the cell IDs that have been removed and also remove them from my analysis.

another possible solution is to load from the new Xenium outs I created from the new segmentation with cellranger, but strangely when i run dimension reduction on the data from the new Xenium outs, it looks very different from running it directly from expected-counts (lots of spurious-looking loop shapes on edges of UMAP).

but in general i'd like to understand why this is happening. are these examples of multi polygons or something?

thanks for your work on this tool!

Jun 27 '25 13:06 dpschreiner

proseg-to-baysor will lose a small number of cells because it filters out cells that have zero assigned transcripts. This really wasn't by choice, but because including them caused errors with xeniumranger. So it's a somewhat unfortunate workaround.

I've been told that the next xeniumranger version will have a way to keep track of the imported cell ids, which should make it much easier to match things up after importing.

The UMAP will look different because the expected counts are not something that can be imported by xeniumranger, and won't agree exactly with the integer point estimate that it uses. This is something that will be better in proseg 3 which I'll hopefully release soon. That version will report a a better integer point estimate that are consistent with all the metadata and can be imported losslessly.

So sorry, the interop with 10X's software is a little annoying right now, but it should be getting better soon.

Jul 02 '25 00:07 dcjones

Dear Dr. Jones,

Thank you for your outstanding work on Proseg. I have a related question. In the attached image, the left UMAP shows integrated data from 7 colorectal cancer patients (Xenium v1 hMulti panel) using direct Proseg v2.0.4 outputs followed by anndata creation. The right UMAP was generated using the Proseg v2.0.4 then Proseg-to-baysor → xeniumranger pipeline. Interestingly, cluster separation is clearly better with the direct Proseg output.

Is this discrepancy something that’s addressed in the development version of Proseg 3? Would you recommend switching to the dev version now, or would it be better to wait for the stable release?

Jul 19 '25 09:07 victormanna

This may be better is proseg 3. The umaps will look slightly different after importing because proseg will output expected counts, which can be non-integers, but that's not something xenium can import. I'm trying to use a different strategy in proseg 3 to estimate integer count matrices that should work better with the xenium ranger/explorer pipeline.

Proseg 3 is usable now, but there are some changes to the arguments and outputs which I still need to document, and I need to test it more, so I wouldn't necessarily recommend switching yet unless highly motivated to.

Jul 21 '25 16:07 dcjones

Hi @dcjones, many thanks for your reply. Yes, I had previously come across your explanations on the expected non-integer counts in another thread. I’ll wait for Proseg 3, and I noticed that the current documentation already includes new flags like --voxel-layers 4 for 3D segmentation.

Once Proseg 3 is out, I’ll open a new thread to ask for your advice on optimal use of those flags.

In the meantime, here’s my current setup

for Xenium v1 proseg transcripts.csv.gz
--xenium
--nthreads 32
--min-qv 20
--prior-seg-reassignment-prob 0.3
--no-z-layer-doubling
--nbglayers 1
--output-expected-counts expected_counts.csv.gz
--output-cell-metadata cell_metadata.csv.gz
--output-transcript-metadata transcript_metadata.csv.gz
--output-cell-polygons cell_polygons.geojson.gz

and for Xenium Prime proseg transcripts.parquet
--xenium
--nthreads 32
--min-qv 20
--no-z-layer-doubling
--nbglayers 1
--prior-seg-reassignment-prob 0.3
--output-expected-counts expected_counts.parquet
--output-cell-metadata cell_metadata.parquet
--output-transcript-metadata transcript_metadata.parquet
--output-cell-polygons cell_polygons.geojson.gz

Best Regards, S Manna

Jul 22 '25 12:07 victormanna

This seems fine, but you've disabled any 3d segmentation here with --nbglayers 1 and --no-z-layer-doubling. Allowing some 3d segmentation usually improves the accuracy of transcript assignments in my testing on xenium.

Jul 22 '25 16:07 dcjones