cellpose
cellpose copied to clipboard
contrib: Alternative distributed_segmentation, automated merge/split tools, ellipsoid shape filter
All contributions are in their own modules in the contrib folder - no cellpose files were modified.
There is an existing distributed_segmentation in contrib well written by the knowledgeable @chrisroat. However, for our own work we have favored an alternative implementation which relies less on dask.array, is more permissive in overlap sizes, and provides tools for the user to set up a cluster object on which to run the distributed computation. This implementation is thoroughly tested in our own environment and already integrated in several work flows. We would now like to make it available to external users (within our institute and some abroad as well) but we'd like to do that by wrapping the primary cellpose repository instead of my fork. So now I'm back to see about getting these tools merged.
Some additional things that have been helpful for us are automated merge and split functions. Our samples typically have ~200K cells so we cannot QC them by hand. We use size and shape to determine which segments have underperformed and then merge or split them as necessary. These tools are not very sophisticated but they do help more than they hurt.
Finally, we are typically segmenting nuclei and like to have some measure of how well the cellpose segments match an elliptical shape - so some tools for fitting ellipsoids to a large number of cellpose segments are also included.
Maybe the only issue to discuss is I added my own dependencies to the setuptools files - like dask, and my little package for building cluster: ClusterWrap. I've left these in for now but I'm happy to remove them if that's preferable for merging. I don't really know the right way to handle these if a contrib has a unique dependency? I guess they should be optional dependencies but I don't know how to set that up. But if you prefer them formatted that way I can learn how to do it.
Hi Greg, thanks for the PR. However, we'll be moving all contrib files into a separate repo and we will let you know where that is so you can pull your code into it directly.
Mi Marius - sounds good I'll wait until the contrib repo is set up. Is it too much to ask for you to notify me here when that is complete, or is there another way that I can find out when you're done setting up that repo without bothering you?
We'll definitely notify you, thanks Greg.
@carsen-stringer @marius10p I've done a major refactor of my distributed Cellpose implementation. The current state is far more readable and easy to learn. It is also functionally superior to my previous implementation in several key ways:
- Dask cluster objects are now defined inside the distributed module. This is better for long term maintenance and removes a dependency.
- New dependencies are very minimal. yaml is so stable it has been considered for inclusion in the stdlib, zarr is an essential big data tool, you can't really get away with not using that one, and dask_jobqueue enables easy interfacing with various cluster managers (such as the LSF cluster at Janelia).
- The main functions have been refactored and are much more digestible. If learning the code, the first place to look should be the function which runs on each block, this will be the most familiar. This function can be run independent of the distributed function to see/test what will happen to each block when you do distribute over a large dataset.
- Multithreading is now supported even for LocalCluster (workstation) runs
- All functions have thorough docstrings. The most important ones are the function that runs on each block and the function you would call to distribute Cellpose over a large image.
This code is all thoroughly tested on both the cluster and a workstation. Going between the cluster and a workstation, one just needs to change a few parameters, it's very simple. I have a Jupyter notebook with several use case examples that I'm happy to share if you want to evaluate yourself.
There are a few desirable things which I intend to add but have not yet included:
- LocalClusters (workstation runs) currently do not support adding heterogeneous workers - that is, one or more workers which include GPUs and other workers which do not. This is an essential improvement and I'll get to it eventually.
- Passing in an already instantiated model instead of creating one on each worker. Currently, it is possible to select from the models already available in Cellpose using the
model_typekeyword argument. But if the user has their own model stored on disk, say a Cellpose 2.0 model, it's not yet possible to pass that in.
Hi Greg, I am working on running cellpose on some large (TB-scale) lightsheet data and came across your dask implementation here. I'm working with an SGE cluster and have tried modifying the script to handle this, but I'm wondering if you'd be willing to share your Jupyter notebook / use case example code? I'm running into some issues that I believe are related to the dask client/scheduler setup, but I want to rule out any other part of my implementation causing the issue. Thanks!