v3 segmentation results differ from v2—how to replicate v2 behavior?
Thank you for developing this great method! I’ve been using Proseg version 2 (v2.0.2) extensively and have been very satisfied with its results. Proseg 3.0.3 is significantly faster, but the segmentation differs noticeably from v2 on the public Xenium BC dataset using the default proseg --xenium transcripts.parquet (BC_public_transcripts.parquet.zip, see image below). I tried to change the cell-compactnessparameters but am not able to replicate previous results.
Could you advise how I can tweak parameters to restore v2-like behavior?
And one more question: How do you compute counts from transcript assignments? Below code does not reproduce your counts.
df = pd.read_csv(proseg_path / "transcript-metadata.csv.gz", compression="gzip")
# estimated counts
X_est = pd.pivot_table(
df,
index="assignment",
columns="gene",
values="probability",
aggfunc="sum",
fill_value=0.0,
)
# integer counts
X = X_est.round().astype(int)
Proseg 3 has a different prior on morphology. I have seen some cases of producing weird cells like these in cases with quite sparse transcript density, but nothing this bad. I'll see if I can reproduce the issue.
You should be able to reproduce the counts from the transcript metadata if you filter out transcripts where the background column is true. You can also just count transcripts, rather than summing probabilities. So something like this:
df = pd.read_csv(proseg_path / "transcript-metadata.csv.gz", compression="gzip")
df_foreground = df[~(transcripts["assignment"].isnull() | df["background"])]
# estimated counts
X_est = pd.pivot_table(
df_foreground,
index="assignment",
columns="gene",
aggfunc=len,
fill_value=0,
)
Actually, could you tell me how the transcripts subset you shared here was constructed? I see the coordinates are not the same as what was in the original file.
Thank you for looking into this!
I subset the Xenium data based on a binary mask (np.ndarray) and shift to origin like this:
rows = np.any(binary_mask, axis=1)
cols = np.any(binary_mask, axis=0)
row_min, _ = np.where(rows)[0][[0, -1]]
col_min,_ = np.where(cols)[0][[0, -1]]
x_px = np.rint(transcripts_df["x_location"].values).astype(int)
y_px = np.rint(transcripts_df["y_location"].values).astype(int)
# subset to transcripts within binary mask
keep = np.zeros(len(transcripts_df), dtype=bool)
keep = (binary_mask[y_px, x_px] > 0)
transcripts_in_polygon_df = transcripts_df[keep].copy()
# Shift to origin
shift_x = col_min
shift_y = row_min
transcripts_in_polygon_df["x_location"] -= shift_x
transcripts_in_polygon_df["y_location"] -= shift_y
Your suggestion for computing counts works well in Proseg v3. In Proseg v2, however, I can’t seem to get the same number of cells — maybe I’m mishandling the conversion from estimated to integer counts?
df = pd.read_csv(proseg_path / "transcript-metadata.csv.gz", compression="gzip")
counts = pd.read_csv(proseg_path / "expected-counts.csv.gz", compression="gzip")
df_foreground = df[~((df["assignment"].isnull()) | (df["background"]==1))]
#counts via transcripts
X_tx = pd.pivot_table(
df_foreground,
index="assignment",
columns="gene",
aggfunc="size",
fill_value=0
)
# round to integer counts
counts_int = counts.round(0).astype(int)
n_nonzero_cells_in_counts = (counts_int.sum(axis=1) != 0).sum()
n_cells_tx = X_tx.shape[0]
if n_nonzero_cells_in_counts != n_cells_tx:
print("Number of cells do not match")
n_cells_txis 1710 and n_nonzero_cells_in_counts= 1636.