proseg icon indicating copy to clipboard operation
proseg copied to clipboard

converting proseg output into xenium ranger compatible format distorted data

Open diala-ar opened this issue 1 year ago • 7 comments

Thanks so much for the amazing package! We tried proseg and we can see an clear improvement in cell segmentation of our samples.

After running proseg, I successfully converted proseg output into baysor compatible format and then imported those into xenium ranger as explained in proseg vignette. However, on visualizing proseg data in xenium explorer, a lot of cells appear to have many reads of a specific gene but the cell feature matrix shows 0 count for this gene in those cells, to note that those transcripts have a qv > 20. Any idea why I am getting this discrepancy?

image

I used xenium ranger v2 (xeniumranger xenium-2.0.0.12) and proseg v1.0.3.

Thanks.

diala-ar avatar May 24 '24 19:05 diala-ar

This is odd, and I do see the same thing in xenium explorer with the counts not necessarily corresponding to what's shown.

I'll investigate this some more, but also should note that I don't use and don't necessarily trust the counts that are generated by converting to a xenium bundle and mainly use it just for visualization purposes.

dcjones avatar May 25 '24 00:05 dcjones

Thanks @dcjones so much for your prompt response! I am assuming we can read the proseg output in an R or python package other than Seurat. Could you please give me some directions, which package to use to read proseg output and perform spatial downstream analyses (I am pretty new to spatial data analyses)? Do you think the problem is at the level of baysor conversion or xenium ranger import? Thanks for your precious help!

diala-ar avatar May 27 '24 15:05 diala-ar

Yes, proseg itself will output a number of tables, most importantly:

  • expected-counts.csv.gz: Give the cell-by-gene count matrix
  • cell-metadata.csv.gz: Spatial cell positions and other information.

I don't have tutorials for analyzing the output on a specific platform, but you can read them into R or python easily anywhere using the standard tools for reading csv files (e.g. read.csv in R, pandas.read_csv in python). I'll probably put up some more detailed tutorials at some point.

dcjones avatar May 28 '24 15:05 dcjones

Thanks again @dcjones for your help! I was able to read in proseg data into an R Seurat object, I will share the code at the end of this message. Below are some comments and questions concerning proseg results:

  1. I noticed some differences between proseg and Xenium bundle output generated from proseg data. Boundaries of cells taken from proseg output are very similar to those taken from Xenium output and sometimes identical. In other cases, as in the examples below, proseg cells boundaries show some extra regions (encircled in red) when compared to Xenium cell boundaries. What is the source of this difference? And should we use proseg or Xenium bundle output?

image

  1. What is the difference between x, y, z columns and observed_x, observed_y, observed_z columns in the transcript-metadata.csv.gz file proseg output and which one on these columns should we use?

  2. What does confusion mean in the transcript-metadata.csv.gz file? Is it related to the diffusion part? How should we filter transcripts in this file so the number of reads in a given cell matches the number of reads in the expected_counts.csv.gs file? For instance, I tried the following filtering criteria and the number still does not match: background==0 & confusion==0 & probability>0.6 & gene does not contain 'Unassigned' or 'Control'. When I used the criteria above, I ended up with 886 reads, while this cell is expected to have 1014.662 reads! Does this criteria make sense?

Below is the code I used to read in proseg output into Seurat:

LoadXenium2` = function (proseg_dir, fov = "fov", assay = "Xenium") {
  data <- ReadXenium2(proseg_dir = proseg_dir)
  segmentations.data <- list(centroids = CreateCentroids(data$centroids), 
                             segmentation = CreateSegmentation(data$segmentations))
  coords <- CreateFOV(coords = segmentations.data, type = c("segmentation", "centroids"), 
                      molecules = data$microns, assay = assay)
  xenium.obj <- CreateSeuratObject(counts = data$matrix[['GeneExpression']], 
                                   assay = assay)
  xenium.obj[["BlankCodeword"]] <- CreateAssayObject(counts = data$matrix[["UnassignedCodeword"]])
  xenium.obj[["ControlCodeword"]] <- CreateAssayObject(counts = data$matrix[["ControlCodeword"]])
  xenium.obj[["ControlProbe"]] <- CreateAssayObject(counts = data$matrix[["ControlProbe"]])
  xenium.obj[[fov]] <- coords
  return(xenium.obj)
}

ReadXenium2 = function(proseg_dir) {
  
  expected_counts = read.csv(file.path(proseg_dir, 'expected-counts.csv.gz'))
  expected_counts = t(expected_counts)
  colnames(expected_counts) = paste0('cell-', 0:(ncol(expected_counts)-1))
  matrix = list(GeneExpression = expected_counts[-grep('Blank|Unassigned|Control', row.names(expected_counts)), ],
                ControlCodeword = expected_counts[grep('ControlCodeword', row.names(expected_counts)), ],
                ControlProbe = expected_counts[grep('ControlProbe', row.names(expected_counts)), ],
                UnassignedCodeword = expected_counts[grep('UnassignedCodeword', row.names(expected_counts)), ])
  
  cell_meta = read.csv(file.path(proseg_dir, "cell-metadata.csv.gz"))
  centroids = data.frame(x = cell_meta$centroid_x, 
                         y = cell_meta$centroid_y, 
                         cell = paste0('cell-', cell_meta$cell))
  
  cell_boundaries_sf = read_sf(file.path(proseg_dir, 'cell-polygons.geojson'))
  st_crs(cell_boundaries_sf) = NA
  cell_boundaries_df = sfheaders::sf_to_df(cell_boundaries_sf)
  cell_boundaries_df = data.frame(cell = paste0('cell-', cell_boundaries_df$sfg_id-1), 
                                  x = cell_boundaries_df$x,
                                  y = cell_boundaries_df$y)

  tx_dt <- as.data.frame(data.table::fread(file.path(proseg_dir, "transcript-metadata.csv.gz")))
  df <- tx_dt[tx_dt$background==0 & tx_dt$confusion==0 & tx_dt$assignment!=4294967295 & tx_dt$probability>0.6, ]
  df <- data.frame(x = df$observed_x, 
                   y = df$observed_y, 
                   gene = df$gene)
  df = df[-grep('Unassigned|Control', df$gene), ]
  
  list(matrix=matrix, centroids=centroids, segmentations=cell_boundaries_df, microns=df)
}

Thanks.

diala-ar avatar May 31 '24 20:05 diala-ar

  1. I think this is due to xenium explorer not supporting cells being multi-polygons, which is the case when you have these voxels that are only connected along the diagonal.
  2. By default, proseg will do transcript repositioning. This is described in the paper, but the idea is to try to slightly adjust positions of some transcripts. So the "observed" columns will correspond to the input data, and the x/y/z columns will correspond to where the transcript was repositioned by proseg.
  3. "confusion" is poorly named, but it's part of the noise model in proseg. So either background or confusion being non-zero means the transcript is thought to be noise. There isn't a way to filter this table to exactly match the expected counts, but you can probably get close by weighting each by the probability column. I find that expected counts work well, but there could certainly be a rationale for filtering the transcript table and making your own point estimate. For example, proseg does not know which genes/features are negative controls, so does not automatically filter these out when counting.

dcjones avatar Jun 04 '24 17:06 dcjones

Hi @dcjones, thanks for developing such a cool tool! I have two questions for you, and they fit well in this thread.

  1. I was wondering how the proseg_to_baysor function works. Not the nitty-gritty, but the overall concept. I see that the output of proseg is a MultiPolygon file, where we can have "voxels that are only connected along the diagonal", as you described. By manually comparing the cell geometry, it seems that a cell with 3 polygons in proseg becomes a single polygon in baysor, and it looks like the conversion is "simply" taking the largest area for cells with multiple polygons. Is this true?
  2. Any specific reason as to why you don't trust the count data from xenium ranger after import-segmentation? I was planning to use it for both visualization purposes and downstream analysis, QC, and so on. The problem is that importing proseg data to Seurat in R has been a bit challenging. It is complaining about the raw data not being integers, which is inevitable based on how proseg works. So re-processing via Xenium Ranger would give me the integers I am looking for. @diala-ar, have you ever had any problems with the integer format when importing proseg to Seurat?

luizalmeida93 avatar Sep 06 '25 01:09 luizalmeida93

  1. That's correct. Proseg produces MultiPolygons, which are usually one polygon, but can sometimes have disconnected parts. Baysor (and thus xeniumranger) only support single polygons, so the rule used in conversion is to just take the biggest piece.
  2. With proseg 2, this was a complication with proseg generating fractional counts. Xenium ranger had to work off different integer estimates. With proseg 3, I've tried to make this much more straightforward. Proseg produces reasonable integer estimates by default which should exactly match the counts you'd get after import segmentation with xeniumranger.

dcjones avatar Sep 08 '25 15:09 dcjones