Questions Regarding Proseg Outputs

Open jaspreetishar opened this issue 6 months ago • 1 comments

Hello,

Thank you for creating Proseg - it is a remarkably sophisticated and impressive method!

I have a few questions regarding the outputs generated by Proseg, as well as some clarifications related to the Nature paper:

Does Proseg create new cells or retain only those provided in the prior segmentations? According to the paper, Proseg can generate cells using only nuclear stains. However, in this GitHub issue (https://github.com/dcjones/proseg/issues/59), it was mentioned that Proseg does not have the capacity to introduce new cells. Could you please clarify?
In the output file transcripts-metadata.csv.gz (generated by Proseg for a public 10x Xenium dataset; https://www.10xgenomics.com/datasets/ffpe-human-pancreas-with-xenium-multimodal-cell-segmentation-1-standard), the assignment column includes a value of 4294967295, which does not appear in cell-metadata.csv.gz:

For reference, the command I used to run Proseg was:

 proseg Xenium_V1_human_Pancreas_FFPE_outs/transcripts.csv.gz --xenium

The corresponding background values in transcripts-metadata.csv.gz include both 1 and 0 for this assignment number - I'm not sure if I should assume these transcripts as background noise?

The total number of transcripts in transcripts-metadata.csv.gz (7,166,842) is lower than the number reported by Xenium (8,073,840). Is there a subsampling strategy or filtering step being applied by Proseg that accounts for this difference?
How would you recommend computing statistics such as the proportion of Proseg-assigned transcripts?
There appears to be a discrepancy in assigned transcript counts between transcripts-metadata.csv.gz and cell-metadata.csv.gz, as described in this issue (https://github.com/dcjones/proseg/issues/16). Has this issue been addressed?

Thank you very much in advance for your time and assistance!

Best, Jaspreet

Jun 02 '25 21:06 jaspreetishar

Hi Jaspreet,

Thanks for trying out proseg!

Proseg optimizes boundaries of a fixed number of cells that are provided in prior segmentation. Usually this is from prior segmentation done on a nuclei stain. So if your prior initialization has n cells, proseg will never output more than n, and will generally output slightly fewer (since often not every prior cell can be represented in the voxelization).
That's a special value (it's the maximum value in 32-bit integer) that's used internally to represent "unassigned". I do that for efficiency, but I realize it's confusing so in the future I'll try to use NA.
There isn't any subsampling, but the is a filtering step that removes transcripts that are far away from any initializing cell center. In the latest version, on Xenium, transcripts with quality values below 20 are also filtered out.
The way I calculate proportion assigned is to just sum the count matrix. The likely confusing part about the assignment column is that it just records what cell a transcript overlaps, but it's no counted when the background or confusion columns are 1.
It's not addressed yet, they'll likely disagree somewhat. I am working on simplifying the output, so it is going to be more consistent and easier to understand in the next major release.

Let me know if you have other questions!

Jun 03 '25 18:06 dcjones