Proof-of-concept: Cellpose distributed on Slurm cluster with AMD GPUs
Hi,
for a project of mine I needed to scale cellpose on a SLURM cluster. To make the topic a little more interesting, the cluster I have at hand has only AMD GPUs. The documentation on distributed cellpose gave hints on how to run on LSF clusters. I also want to mention that there is already some documentation on how to run cellpose on AMD GPUs.
The first contribution of this PR is a working conda environment (environment-rocm.yaml) file which works for inference on AMD GPUs. I am happy to update the install documentation accordingly.
The second contribution is a medium-sized test case for a slurm cluster (~~cellpose/contrib/test_slurm.py~~cellpose/contrib/cluster_script.py). ~~The example data is not special by any means - and not working particularly well with cellposeSAM, if someone has a hint on a nice (1024 x 1024 x 1024 px ) dataset which is worth highlighting in the cellpose distributed documentation I am open for suggestions.~~ My hope is that the test can be serve as reference for checking cellposes distributed on different clusters before users try to run cellpose with their own data.
Lastly, I modified cellpose/contrib/distributed_segmentation.py so that it now works for my circumstances. Note that there two things left to be done:
~~1. the code still needs some clean-up after my initial tests with cropping/ transposing~~
2. the PR will in its current form break the functionality of the janeliaLSFCluster class due to missing abstraction in distributed_eval with respect to the mem, cores, and ncpus .
3. Scaling the cluster to 0 workers; changing the worker config and rescaling did not work for me. I am happy to run further tests, but I would need some assistance with dask debugging.
I am happy to polish the code and documentation the next days. Since I am not really a dask expert I am very curious about feedback about my dask usage.
Best wishes, Eric
fixes #1111
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 42.29%. Comparing base (bf958cb) to head (3af4df3).
:warning: Report is 32 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
+ Coverage 42.19% 42.29% +0.09%
==========================================
Files 16 16
Lines 3773 3783 +10
==========================================
+ Hits 1592 1600 +8
- Misses 2181 2183 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
In the current form, python cellpose/contrib/cluster_script.py runs end-to-end (incl. test data download, segmentation, merging, saving) in approximately 8 mins if the requested compute resources are immediately available.
Looking forward to feedback!