cellpose icon indicating copy to clipboard operation
cellpose copied to clipboard

How to specify device on A100 that is MIGed[INSTALL]

Open derekthirstrup opened this issue 1 year ago • 1 comments

Install problem I am getting an error message when trying to run a script on our A100 linux nodes. The GPUs have been MIGed. I ran "export CUDA_VISIBLE_DEVICES='MIG-849e08aa-b1bd-5744-babc-e89e41b926b4'" to specify which gpu to use before executing the script but it seems that Cellpose does not find the cuda device. What is the optimal way to specify the cuda device ID so that I can use the gpu? The script works fine on a 4090 windows workstation so this issue seems specific to the A100 MIG device config.

Environment info

packages in environment at /home/derekt/miniconda3/envs/cellpose:

Name Version Build Channel

_libgcc_mutex 0.1 main _openmp_mutex 5.1 1_gnu aicsimageio 4.14.0 pypi_0 pypi aiobotocore 2.5.4 pypi_0 pypi aiohttp 3.9.3 pypi_0 pypi aioitertools 0.11.0 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi annotated-types 0.6.0 pypi_0 pypi asciitree 0.3.3 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi attrs 23.2.0 pypi_0 pypi blas 1.0 mkl botocore 1.31.17 pypi_0 pypi bzip2 1.0.8 h5eee18b_5 ca-certificates 2024.3.11 h06a4308_0 cellpose 3.0.7 pypi_0 pypi certifi 2024.2.2 pypi_0 pypi cffi 1.16.0 py310h5eee18b_0 charset-normalizer 3.3.2 pypi_0 pypi click 8.1.7 pypi_0 pypi cloudpickle 3.0.0 pypi_0 pypi cuda 11.6.1 0 nvidia cuda-cccl 11.6.55 hf6102b2_0 nvidia cuda-command-line-tools 11.6.2 0 nvidia cuda-compiler 11.6.2 0 nvidia cuda-cudart 11.6.55 he381448_0 nvidia cuda-cudart-dev 11.6.55 h42ad0f4_0 nvidia cuda-cuobjdump 11.6.124 h2eeebcb_0 nvidia cuda-cupti 11.6.124 h86345e5_0 nvidia cuda-cuxxfilt 11.6.124 hecbf4f6_0 nvidia cuda-driver-dev 11.6.55 0 nvidia cuda-gdb 12.4.127 0 nvidia cuda-libraries 11.6.1 0 nvidia cuda-libraries-dev 11.6.1 0 nvidia cuda-memcheck 11.8.86 0 nvidia cuda-nsight 12.4.127 0 nvidia cuda-nsight-compute 12.4.1 0 nvidia cuda-nvcc 11.6.124 hbba6d2d_0 nvidia cuda-nvdisasm 12.4.127 0 nvidia cuda-nvml-dev 11.6.55 haa9ef22_0 nvidia cuda-nvprof 12.4.127 0 nvidia cuda-nvprune 11.6.124 he22ec0a_0 nvidia cuda-nvrtc 11.6.124 h020bade_0 nvidia cuda-nvrtc-dev 11.6.124 h249d397_0 nvidia cuda-nvtx 11.6.124 h0630a44_0 nvidia cuda-nvvp 12.4.127 0 nvidia cuda-opencl 12.4.99 0 nvidia cuda-runtime 11.6.1 0 nvidia cuda-samples 11.6.101 h8efea70_0 nvidia cuda-sanitizer-api 12.4.127 0 nvidia cuda-toolkit 11.6.1 0 nvidia cuda-tools 11.6.1 0 nvidia cuda-visual-tools 11.6.1 0 nvidia cudatoolkit 11.6.0 habf752d_9 nvidia dask 2024.4.1 pypi_0 pypi distributed 2024.4.1 pypi_0 pypi elementpath 4.4.0 pypi_0 pypi expat 2.5.0 h6a678d5_0 fasteners 0.19 pypi_0 pypi fastremap 1.14.1 pypi_0 pypi filelock 3.13.1 py310h06a4308_0 fire 0.6.0 pypi_0 pypi frozenlist 1.4.1 pypi_0 pypi fsspec 2023.6.0 pypi_0 pypi gds-tools 1.9.0.20 0 nvidia gmp 6.2.1 h295c915_3 gmpy2 2.1.2 py310heeb90bb_0 idna 3.6 pypi_0 pypi imagecodecs 2024.1.1 pypi_0 pypi imageio 2.34.0 pypi_0 pypi importlib-metadata 7.1.0 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306 jinja2 3.1.3 py310h06a4308_0 jmespath 1.0.1 pypi_0 pypi lazy-loader 0.4 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1 libcublas 11.9.2.110 h5e84587_0 nvidia libcublas-dev 11.9.2.110 h5c901ab_0 nvidia libcufft 10.7.1.112 hf425ae0_0 nvidia libcufft-dev 10.7.1.112 ha5ce4c0_0 nvidia libcufile 1.9.0.20 0 nvidia libcufile-dev 1.9.0.20 0 nvidia libcurand 10.3.5.119 0 nvidia libcurand-dev 10.3.5.119 0 nvidia libcusolver 11.3.4.124 h33c3c4e_0 nvidia libcusparse 11.7.2.124 h7538f96_0 nvidia libcusparse-dev 11.7.2.124 hbbe9722_0 nvidia libffi 3.4.4 h6a678d5_0 libgcc-ng 11.2.0 h1234567_1 libgomp 11.2.0 h1234567_1 libnpp 11.6.3.124 hd2722f0_0 nvidia libnpp-dev 11.6.3.124 h3c42840_0 nvidia libnvjitlink 12.1.105 0 nvidia libnvjpeg 11.6.2.124 hd473ad6_0 nvidia libnvjpeg-dev 11.6.2.124 hb5906b9_0 nvidia libprotobuf 3.20.3 he621ea3_0 libstdcxx-ng 11.2.0 h1234567_1 libuuid 1.41.5 h5eee18b_0 llvm-openmp 14.0.6 h9e868ea_0 llvmlite 0.42.0 pypi_0 pypi locket 1.0.0 pypi_0 pypi lxml 4.9.4 pypi_0 pypi markupsafe 2.1.3 py310h5eee18b_0 mkl 2023.1.0 h213fc3f_46344 mkl-service 2.4.0 py310h5eee18b_1 mkl_fft 1.3.8 py310h5eee18b_0 mkl_random 1.2.4 py310hdb19cb5_0 mpc 1.1.0 h10f8cd9_1 mpfr 4.0.2 hb69a4c5_1 mpmath 1.3.0 py310h06a4308_0 msgpack 1.0.8 pypi_0 pypi multidict 6.0.5 pypi_0 pypi natsort 8.4.0 pypi_0 pypi ncurses 6.4 h6a678d5_0 networkx 3.1 py310h06a4308_0 ninja 1.10.2 h06a4308_5 ninja-base 1.10.2 hd09550d_5 nsight-compute 2024.1.1.4 0 nvidia numba 0.59.1 pypi_0 pypi numcodecs 0.12.1 pypi_0 pypi numpy 1.26.4 py310h5f9d8c6_0 numpy-base 1.26.4 py310hb5e798b_0 ome-types 0.5.1.post1 pypi_0 pypi ome-zarr 0.8.3 pypi_0 pypi opencv-python-headless 4.9.0.80 pypi_0 pypi openssl 3.0.13 h7f8727e_0 packaging 24.0 pypi_0 pypi pandas 2.2.1 pypi_0 pypi partd 1.4.1 pypi_0 pypi pillow 10.3.0 pypi_0 pypi pip 23.3.1 py310h06a4308_0 psutil 5.9.8 pypi_0 pypi pycparser 2.21 pyhd3eb1b0_0 pydantic 2.6.4 pypi_0 pypi pydantic-compat 0.1.2 pypi_0 pypi pydantic-core 2.16.3 pypi_0 pypi python 3.10.14 h955ad1f_0 python-dateutil 2.9.0.post0 pypi_0 pypi pytorch 1.13.0 py3.10_cuda11.6_cudnn8.3.2_0 pytorch pytorch-cuda 11.6 h867d48c_1 pytorch pytorch-mutex 1.0 cuda pytorch pytz 2024.1 pypi_0 pypi pyyaml 6.0.1 py310h5eee18b_0 readline 8.2 h5eee18b_0 requests 2.31.0 pypi_0 pypi resource-backed-dask-array 0.1.0 pypi_0 pypi roifile 2024.3.20 pypi_0 pypi s3fs 2023.6.0 pypi_0 pypi scikit-image 0.22.0 pypi_0 pypi scipy 1.13.0 pypi_0 pypi setuptools 68.2.2 py310h06a4308_0 six 1.16.0 pypi_0 pypi sortedcontainers 2.4.0 pypi_0 pypi sqlite 3.41.2 h5eee18b_0 sympy 1.12 py310h06a4308_0 tbb 2021.8.0 hdb19cb5_0 tblib 3.0.0 pypi_0 pypi termcolor 2.4.0 pypi_0 pypi tifffile 2023.2.28 pypi_0 pypi tk 8.6.12 h1ccaba5_0 toolz 0.12.1 pypi_0 pypi tornado 6.4 pypi_0 pypi tqdm 4.66.2 pypi_0 pypi typing_extensions 4.9.0 py310h06a4308_1 tzdata 2024.1 pypi_0 pypi urllib3 1.26.18 pypi_0 pypi wheel 0.41.2 py310h06a4308_0 wrapt 1.16.0 pypi_0 pypi xarray 2024.3.0 pypi_0 pypi xmlschema 3.2.0 pypi_0 pypi xsdata 24.3.1 pypi_0 pypi xz 5.4.6 h5eee18b_0 yaml 0.2.5 h7b6447c_0 yarl 1.9.4 pypi_0 pypi zarr 2.15.0 pypi_0 pypi zict 3.0.0 pypi_0 pypi zipp 3.18.1 pypi_0 pypi zlib 1.2.13 h5eee18b_0

Traceback (most recent call last): File "/allen/aics/microscopy/ClusterOutput/ProcessingScripts/denoise_3D_timelapse.py", line 152, in fire.Fire(main) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/allen/aics/microscopy/ClusterOutput/ProcessingScripts/denoise_3D_timelapse.py", line 149, in main denoise_images_from_csv(csv_file_path, output_path, model_params, eval_params, channel_to_process, max_workers) File "/allen/aics/microscopy/ClusterOutput/ProcessingScripts/denoise_3D_timelapse.py", line 124, in denoise_images_from_csv denoise_model = DenoiseModel(**model_params) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/cellpose/denoise.py", line 643, in init self.net.load_model(self.pretrained_model, device=self.device) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/cellpose/resnet_torch.py", line 294, in load_model state_dict = torch.load(filename, map_location=device) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 789, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 1131, in _load result = unpickler.load() File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 1101, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 1083, in load_tensor wrap_storage=restore_location(storage, location), File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 1055, in restore_location return default_restore_location(storage, str(map_location)) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 215, in default_restore_location result = fn(storage, location) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 182, in _cuda_deserialize device = validate_cuda_device(location) File "/home/derekt/miniconda3/envs/cellpose/lib/python3.10/site-packages/torch/serialization.py", line 173, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on CUDA device ' RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

derekthirstrup avatar Apr 05 '24 18:04 derekthirstrup

Cellpose uses the defaults through pytorch to find available GPUs. If you can successfully run a script similar to:

import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
    print(f"GPU available: {gpu_name}")
else:
    print("No GPU available.")

then Cellpose should work as well. If the above script doesn't work then your issue could be with your pytorch/CUDA installation

mrariden avatar Apr 17 '24 20:04 mrariden

also I think you want to specify your device manually in the API not the CLI:

model = models.CellposeModel(device=torch.device("..."), model_type="...", gpu=True)

closing for now but let us know if this doesn't solve it

carsen-stringer avatar Sep 11 '24 07:09 carsen-stringer