proseg Use with high axial sampling 3D data?

Congrats on the paper being out! The comparison to existing methods is interesting and we certainly see an inflation of cell numbers at times using Baysor on our data.

We generate 3D spatial transcriptomics data using custom microscope(s) and decoding package. We are comfortable either changing our output or figuring out what flags to set to import our transcript information into proseg.

That said, our data is densely 3D sampled (Nyquist sampling). For example, we have zyx voxel sizes of [.23,.115,.115] from one instrument, over about 30 microns. That yield on the order of 130 z-planes in a FOV. It is not immediately clear to us from the documentation how best to setup a run for such data. The plexity of our panels is about 300 genes. Any advice or guidance on where to look in the codebase is welcome!

May 27 '25 19:05 dpshepherd

Thanks! This is an interesting use case!

As far as input format, it's pretty general, but you minimally need a csv file that has the following columns (where the column name is given with argument in parethesis:

x (--x-column)
y (--y-column)
z (--z-column)
gene (--gene-column)
prior cell assignment (--cell_id_column)

You'll also need to use --cell-id-unassigned to what value in the cell id column represent unassigned. Prior cell assignments can be pretty approximate.

Coordinates are assumed to microns by default (there is a --coordinate-scale parameter to scale them to microns).

There is probably some necessary tinkering to get good results in proper 3d data. Mainly, in figuring out the right number of voxel layers to use. Proseg uses a strategy of starting at a lower resolution and increasing resolution as it goes. There are a few arguments to control how that works:

--voxel-layers: numer of z-axis layers to initialize with
--initial-voxel-size: size (x/y side length in microns) to initialize with
--schedule: a comma separated list of iteration counts. Between each count, resolution is increased by halving the voxel size.
--no-z-layer-doubling: don't double resolution on the z-axis.

I might start of trying with a relatively large number of z-axis layers, and disabling z layer doubling. But I'm not sure what would work best.

You may want to disable transcript repositioning on a first attempt (--no-diffusion). The prior used here allows a lot of movement on the z-axis. Much to my shame, that's hard coded in the current version, here: https://github.com/dcjones/proseg/blob/2400a2b6a57eb900fc2d4c571cbbb89c1c2aec64/src/main.rs#L775.

One potential issue is that the current release of proseg is it is not super memory efficient, particularly with a lot of voxel layers. The new version I'm working on should be much better in this regard. That's in the proseg3 branch. It's usable, but I'm still working on it. I can help you run that, if performance/memory becomes an issue.

Let me know how it goes, and if I can help troubleshoot!

May 27 '25 23:05 dcjones

Thanks! We segment cells (poyDT marker), not nuclei, for our initial assignment. It looks like we should use the --use-cell-initialization flag?

We have a high-memory server, so we can give it a try on that if we run into issues on our standard workstation.

May 28 '25 15:05 dpshepherd

That shouldn't be needed. That tells it to ignore the cellular compartment column, but if your input doesn't one have that it'll work just the same.

May 28 '25 16:05 dcjones

We are coming back around to trying this with the newest updates you've made.

We have a 16-plex smFISH experiment in cell culture we performed using our 3D approach. The genes are fairly high expression, so we have a lot of transcripts. The data voxel size in zyx order is [0.315, 0.098, .098] microns. We are able to load it into proseg no problem using the following code,

~/Documents/github/proseg/target/release/proseg --gene-column gene_id -x global_x -y global_y -z global_z --diffusion-probability 0.0 --voxel-layers 15 --cell-id-column cell_id --cell-id-unassigned -1 -t 32 --density-bins 2 --enforce-connectivity --output-path /media/dps/data/qi2lab/20251028_bartelle_smFISH_mm_microglia_newbuffers/qi2labdatastore/all_tiles_filtered_decoded_features/ --output-spatialdata proseg.zarr --output-cell-polygons cell-polygons.geojson.gz --output-counts counts.mtx.gz /media/dps/data/qi2lab/20251028_bartelle_smFISH_mm_microglia_newbuffers/qi2labdatastore/all_tiles_filtered_decoded_features/decoded_features.csv.gz

This gives the following output:

Using 32 threads
Finished reading input
Read dataset:
  1975993 transcripts
      730 cells
       16 genes
        1 fovs
00:00:00 ############################################################ | log-likelihood: -1938634.3 | assigned: 1940371 / 1975993 (98.20%) | non-background: (89.21%)

The .mtx.gz file is generated pretty much right away. We've let the code sit for a few hours now and it has not created the spatial-data structure.

This is on a server with 32 cores and 1 TB of RAM. Proseg appears to be using about 30-40 GB of RAM, I didn't do a full profile.

Is this expected behavior? If so, any hints to speed up the process? We can sacrifice output resolution to get to a final data structure faster.

Thanks!

Nov 01 '25 21:11 dpshepherd

Ah - I think we figured it out. There were some erroneous cell_id entries for transcripts outside of cells. So that was probably throwing off the algorithm.

Nov 02 '25 17:11 dpshepherd