cellpose icon indicating copy to clipboard operation
cellpose copied to clipboard

contrib: Alternative distributed_segmentation, automated merge/split tools, ellipsoid shape filter

Open GFleishman opened this issue 2 years ago • 6 comments

All contributions are in their own modules in the contrib folder - no cellpose files were modified.

There is an existing distributed_segmentation in contrib well written by the knowledgeable @chrisroat. However, for our own work we have favored an alternative implementation which relies less on dask.array, is more permissive in overlap sizes, and provides tools for the user to set up a cluster object on which to run the distributed computation. This implementation is thoroughly tested in our own environment and already integrated in several work flows. We would now like to make it available to external users (within our institute and some abroad as well) but we'd like to do that by wrapping the primary cellpose repository instead of my fork. So now I'm back to see about getting these tools merged.

Some additional things that have been helpful for us are automated merge and split functions. Our samples typically have ~200K cells so we cannot QC them by hand. We use size and shape to determine which segments have underperformed and then merge or split them as necessary. These tools are not very sophisticated but they do help more than they hurt.

Finally, we are typically segmenting nuclei and like to have some measure of how well the cellpose segments match an elliptical shape - so some tools for fitting ellipsoids to a large number of cellpose segments are also included.

GFleishman avatar Oct 13 '23 20:10 GFleishman

Maybe the only issue to discuss is I added my own dependencies to the setuptools files - like dask, and my little package for building cluster: ClusterWrap. I've left these in for now but I'm happy to remove them if that's preferable for merging. I don't really know the right way to handle these if a contrib has a unique dependency? I guess they should be optional dependencies but I don't know how to set that up. But if you prefer them formatted that way I can learn how to do it.

GFleishman avatar Oct 13 '23 20:10 GFleishman

Hi Greg, thanks for the PR. However, we'll be moving all contrib files into a separate repo and we will let you know where that is so you can pull your code into it directly.

marius10p avatar Oct 14 '23 02:10 marius10p

Mi Marius - sounds good I'll wait until the contrib repo is set up. Is it too much to ask for you to notify me here when that is complete, or is there another way that I can find out when you're done setting up that repo without bothering you?

GFleishman avatar Oct 14 '23 15:10 GFleishman

We'll definitely notify you, thanks Greg.

marius10p avatar Oct 14 '23 15:10 marius10p

@carsen-stringer @marius10p I've done a major refactor of my distributed Cellpose implementation. The current state is far more readable and easy to learn. It is also functionally superior to my previous implementation in several key ways:

This code is all thoroughly tested on both the cluster and a workstation. Going between the cluster and a workstation, one just needs to change a few parameters, it's very simple. I have a Jupyter notebook with several use case examples that I'm happy to share if you want to evaluate yourself.

There are a few desirable things which I intend to add but have not yet included:

  • LocalClusters (workstation runs) currently do not support adding heterogeneous workers - that is, one or more workers which include GPUs and other workers which do not. This is an essential improvement and I'll get to it eventually.
  • Passing in an already instantiated model instead of creating one on each worker. Currently, it is possible to select from the models already available in Cellpose using the model_type keyword argument. But if the user has their own model stored on disk, say a Cellpose 2.0 model, it's not yet possible to pass that in.

GFleishman avatar May 22 '24 22:05 GFleishman

Hi Greg, I am working on running cellpose on some large (TB-scale) lightsheet data and came across your dask implementation here. I'm working with an SGE cluster and have tried modifying the script to handle this, but I'm wondering if you'd be willing to share your Jupyter notebook / use case example code? I'm running into some issues that I believe are related to the dask client/scheduler setup, but I want to rule out any other part of my implementation causing the issue. Thanks!

vbrow29 avatar Aug 23 '24 20:08 vbrow29