cellpose icon indicating copy to clipboard operation
cellpose copied to clipboard

Proof-of-concept: Cellpose distributed on Slurm cluster with AMD GPUs

Open erjel opened this issue 3 months ago • 2 comments

Hi,

for a project of mine I needed to scale cellpose on a SLURM cluster. To make the topic a little more interesting, the cluster I have at hand has only AMD GPUs. The documentation on distributed cellpose gave hints on how to run on LSF clusters. I also want to mention that there is already some documentation on how to run cellpose on AMD GPUs.

The first contribution of this PR is a working conda environment (environment-rocm.yaml) file which works for inference on AMD GPUs. I am happy to update the install documentation accordingly.

The second contribution is a medium-sized test case for a slurm cluster (~~cellpose/contrib/test_slurm.py~~cellpose/contrib/cluster_script.py). ~~The example data is not special by any means - and not working particularly well with cellposeSAM, if someone has a hint on a nice (1024 x 1024 x 1024 px ) dataset which is worth highlighting in the cellpose distributed documentation I am open for suggestions.~~ My hope is that the test can be serve as reference for checking cellposes distributed on different clusters before users try to run cellpose with their own data.

Lastly, I modified cellpose/contrib/distributed_segmentation.py so that it now works for my circumstances. Note that there two things left to be done:

~~1. the code still needs some clean-up after my initial tests with cropping/ transposing~~ 2. the PR will in its current form break the functionality of the janeliaLSFCluster class due to missing abstraction in distributed_eval with respect to the mem, cores, and ncpus . 3. Scaling the cluster to 0 workers; changing the worker config and rescaling did not work for me. I am happy to run further tests, but I would need some assistance with dask debugging.

I am happy to polish the code and documentation the next days. Since I am not really a dask expert I am very curious about feedback about my dask usage.

Best wishes, Eric

fixes #1111

erjel avatar Sep 29 '25 15:09 erjel

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 42.29%. Comparing base (bf958cb) to head (3af4df3). :warning: Report is 32 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
+ Coverage   42.19%   42.29%   +0.09%     
==========================================
  Files          16       16              
  Lines        3773     3783      +10     
==========================================
+ Hits         1592     1600       +8     
- Misses       2181     2183       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 30 '25 18:09 codecov[bot]

In the current form, python cellpose/contrib/cluster_script.py runs end-to-end (incl. test data download, segmentation, merging, saving) in approximately 8 mins if the requested compute resources are immediately available.

Looking forward to feedback!

erjel avatar Oct 01 '25 11:10 erjel