CaImAn icon indicating copy to clipboard operation
CaImAn copied to clipboard

tensorflow run from caimina fails to find CUDA device yet tensorflow from cli does it.

Open pbl007 opened this issue 1 year ago • 16 comments

For better support, please use the template below to submit your issue. When your issue gets resolved please remember to close it.

Sometimes errors while running CNMF occur during parallel processing which prevents the log to provide a meaningful error message. Please reproduce your error with setting dview=None.

If you need to upgrade CaImAn follow the instructions given in the documentation.

  • Tell us a bit about your setup:
  1. Operating system (Linux/macOS/Windows):

  2. Python version (3.x): Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] on linux

  3. Working environment (Python IDE/Jupyter Notebook/other): Python

  4. Which of the demo scripts you're using for your analysis (if applicable): demo_caiman_basic.py

  5. CaImAn version*:

  6. CaImAn installation process (pip install ./pip install -e ./conda): mamba, not from source

*You can get the CaImAn version by creating a params object and then typing params.data['caiman_version']. If the field doesn't exist, type N/A and consider upgrading) not sure how to do this... yet.

  • Describe the issue that you are experiencing

  • Copy error log below pb@claustrum ~/c/d/general> python demo_caiman_basic.py (caiman) 6993 [params.py: check_consistency():919][230492] Changing rf from 10 to 26 because the constraint rf > gSiz was not satisfied. USING MODEL:/home/pb/caiman_data/model/cnn_model.json 2022-07-06 23:33:12.908407: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2022-07-06 23:33:12.908450: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: claustrum 2022-07-06 23:33:12.908456: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: claustrum 2022-07-06 23:33:12.908607: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 510.54.0 2022-07-06 23:33:12.908629: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 510.54.0 2022-07-06 23:33:12.908635: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 510.54.0 2022-07-06 23:33:12.913900: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2/2 [==============================] - 1s 14ms/step Component:0 pb@claustrum ~/c/d/general> python (caiman) Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf print(tf.test.gpu_device_name()) 2022-07-06 23:36:46.687058: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-06 23:36:47.596321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:0 with 46689 MB memory: -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:a1:00.0, compute capability: 8.6 /device:GPU:0

  • If you're not reporting an error, type your message below

pbl007 avatar Jul 06 '22 20:07 pbl007

Hello, I don't entirely understand the commandline you're using; I can see that this is probably your prompt:

pb@claustrum ~/c/d/general>

But to run python in the caiman environment, are you typing python or python (caiman) ? Or is that a cut'n'paste oddity?

pgunn avatar Jul 07 '22 15:07 pgunn

Hi Thanks for the prompt reply!

Indeed an odd copy/paste

(caiman) is the conga env (displayed by the shell). The command was simply:

python demo_caiman_basic.py

which generated the said tensorflow error. In turn,

python

import tensorflow as tf print(tf.test.gpu_device_name())

results in

2022-07-06 23:36:46.687058: I tensorflow/core/platform/cpu_feature_guard.cc:151http://cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-06 23:36:47.596321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525http://gpu_device.cc:1525] Created device /device:GPU:0 with 46689 MB memory: -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:a1:00.0, compute capability: 8.6 /device:GPU:0

from the same condo environment.

Best Pablo

Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 7, 2022, at 18:14, Pat Gunn @.@.>> wrote:

Hello, I don't entirely understand the commandline you're using; I can see that this is probably your prompt:

@.*** ~/c/d/general>

But to run python in the caiman environment, are you typing python or python (caiman) ? Or is that a cut'n'paste oddity?

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1177772374, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKP5GEFWQFA5A7FFYC3VS3X4ZANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 07 '22 15:07 pbl007

Hello, I think this is probably harmless; it's possible for Caiman to use a GPU, but some of our test code (and demos) don't do so by default because trying to use a GPU might mean getting an ancient one that's not capable of running the network.

We have some safeguards that (internally) set CUDA_VISIBLE_DEVICES to -1 to try not to use a GPU unless the user specifically requests it (see caiman/components_evaluation.py:evaluate_components_CNN() for an example); if you're bumping into that (or one of the few other instances of similar efforts) then it would not be surprising to see your exact output.

--Pat

pgunn avatar Jul 07 '22 16:07 pgunn

Thanks Pat! I’ll keep an eye open as we start setting up in a new server. Our pipeline was running previously on a cluster and I think we didn’t see the logs/outputs as it was not run under an interactive session but rather as a job. Best Pablo

Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 7, 2022, at 19:11, Pat Gunn @.@.>> wrote:

Hello, I think this is probably harmless; it's possible for Caiman to use a GPU, but some of our test code (and demos) don't do so by default because trying to use a GPU might mean getting an ancient one that's not capable of running the network.

We have some safeguards that (internally) set CUDA_VISIBLE_DEVICES to -1 to try not to use a GPU unless the user specifically requests it (see caiman/components_evaluation.py:evaluate_components_CNN() for an example); if you're bumping into that (or one of the few other instances of similar efforts) then it would not be surprising to see your exact output.

--Pat

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1177869247, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKIQF7W3PVO5SZFTAVDVS36SZANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 07 '22 16:07 pbl007

Closing this for now

pgunn avatar Jul 07 '22 18:07 pgunn

Hi Pat

sorry to bother w/this again. This is showing up again in demo_pipe_as_func.py which we use as working example for all the analysis that was done in the lab so far w/caiman. Could we find a way to fix it? Thanks in advance!

Saving files to /storage/pblab_shared_data/CaImAn/David/soma/fov2_soma_gcamp_x20_mag_2_512px_30hz_00001-1_CHANNEL_1_memmap__d1_512_d2_512_d3_1_order_C_frames_627_.mmap 349813 [components_evaluation.py:classify_components_ep():243][676294] Component 4 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349829 [components_evaluation.py:classify_components_ep():243][676291] Component 10 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349846 [components_evaluation.py:classify_components_ep():243][676294] Component 22 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349848 [components_evaluation.py:classify_components_ep():243][676295] Component 27 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349859 [components_evaluation.py:classify_components_ep():243][676288] Component 16 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349865 [components_evaluation.py:classify_components_ep():243][676288] Component 19 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349866 [components_evaluation.py:classify_components_ep():243][676292] Component 13 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349931 [components_evaluation.py:classify_components_ep():243][676290] Component 35 is only active jointly with neighboring components. Space correlation calculation might be unreliable. 349965 [components_evaluation.py:classify_components_ep():243][676289] Component 45 is only active jointly with neighboring components. Space correlation calculation might be unreliable. USING MODEL:/home/pb/caiman_data/model/cnn_model.json 2022-07-11 11:33:07.287020: E tensorflow/stream_executor/cuda/cuda_driver.cc:271http://cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2022-07-11 11:33:07.287062: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169http://cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: claustrum 2022-07-11 11:33:07.287068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176http://cuda_diagnostics.cc:176] hostname: claustrum 2022-07-11 11:33:07.287222: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200http://cuda_diagnostics.cc:200] libcuda reported version is: 510.54.0 2022-07-11 11:33:07.287244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204http://cuda_diagnostics.cc:204] kernel reported version is: 510.54.0 2022-07-11 11:33:07.287250: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310http://cuda_diagnostics.cc:310] kernel version seems to match DSO: 510.54.0 2022-07-11 11:33:07.287872: I tensorflow/core/platform/cpu_feature_guard.cc:151http://cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 12/12 [==============================] - 0s 16ms/step


Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 7, 2022, at 21:33, Pat Gunn @.@.>> wrote:

Closing this for now

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1178049578, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKLNJ7ZQZILSXC7XDOTVS4PGLANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 11 '22 08:07 pbl007

Hello, I think demo_pipe_as_func.py is probably something you made yourself, perhaps modified off of an existing notebook? If you'd like to pastebin it, I can take a look.

pgunn avatar Jul 11 '22 18:07 pgunn

Hi Pat,

thanks again.

The original demo code is most likely not ours but I am picking up work done by teh student whom set the pipeline, he recently graduated.

We have a pipeline for running in "batch" mode where we generate a toml config file (like the one attached here), then we run 'run_caiman_with_config.py" which calls demo_pipeline_func_local.py

I can also uplad / point to the test data I am using.

Best

Pablo


Pablo Blinder, PhD http://pblab.tau.ac.ilhttp://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University


From: Pat Gunn @.> Sent: Monday, July 11, 2022 9:49 PM To: flatironinstitute/CaImAn @.> Cc: Pablo Blinder @.>; Author @.> Subject: Re: [flatironinstitute/CaImAn] tensorflow run from caimina fails to find CUDA device yet tensorflow from cli does it. (Issue #996)

Hello, I think demo_pipe_as_func.py is probably something you made yourself, perhaps modified off of an existing notebook? If you'd like to pastebin it, I can take a look.

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1180748638, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKLILO2X77UPS5GRGEDVTRUERANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 11 '22 20:07 pbl007

Hello, I don't see any pastebin URL or attachment, although if you used the email interface to github it may not parse it in.

pgunn avatar Jul 11 '22 20:07 pgunn

Attaching here code.zip .

pbl007 avatar Jul 11 '22 20:07 pbl007

Looking at the code paths, right now I believe you're hitting evaluate_components_CNN() through one of the longer call paths, and there's not currently a way to signal intent to use the GPU through that call path.

My understanding is that the code should work anyhow, using software rendering, so I don't think you're actually hurt by this, but it would be nice for the software to have the option of using a GPU if you have one.

I think, for something this deep in the call stack and for something relatively external like this, I'm likely to modify that code to use an environment variable to express intent to use a GPU; the next version of caiman will support this. Or if you're feeling brave, I have committed it to the dev branch. The env var will be called CAIMAN_ALLOW_GPU

pgunn avatar Jul 12 '22 21:07 pgunn

OK. Thanks for reopening this issue and putting it on the dev branch. I am certainly not brave enough (yet) when it comes to heavy-lifting python coding.

What call path will allow me to signal GPU use?

In the meantime, I can exploit some 200 CPU cores, just need to to pass n_processes as parameter. I’ll focus on this for the time being.


Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 13, 2022, at 00:35, Pat Gunn @.@.>> wrote:

Looking at the code paths, right now I believe you're hitting evaluate_components_CNN() through one of the longer call paths, and there's not currently a way to signal intent to use the GPU through that call path.

My understanding is that the code should work anyhow, using software rendering, so I don't think you're actually hurt by this, but it would be nice for the software to have the option of using a GPU if you have one.

I think, for something this deep in the call stack and for something relatively external like this, I'm likely to modify that code to use an environment variable to express intent to use a GPU; the next version of caiman will support this. Or if you're feeling brave, I have committed it to the dev branch. The env var will be called CAIMAN_ALLOW_GPU

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1182522071, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKN75QLA3TZUXBWHYUDVTXQKPANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 14 '22 08:07 pbl007

When this makes it into a release you won't need to change your code, you'll just be able to set the environment variable (there are several ways to do this, from doing it before you launch jupyter to assigning into os.environ) before you import the caiman libraries and you'll be good to go.

Cheers, Pat

pgunn avatar Jul 14 '22 12:07 pgunn

Great! Thanks again! Any idea when will be the next release?

Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 14, 2022, at 15:43, Pat Gunn @.@.>> wrote:

When this makes it into a release you won't need to change your code, you'll just be able to set the environment variable (there are several ways to do this, from doing it before you launch jupyter to assigning into os.environ) before you import the caiman libraries and you'll be good to go.

Cheers, Pat

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1184402962, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKNFBZ445OCZT35PYVLVUADQVANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 14 '22 13:07 pbl007

It's pretty much whenever I think there's enough content to make it worth going through the release process. On average it's 2-3 weeks, I think.

pgunn avatar Jul 14 '22 14:07 pgunn

Fantastic! Much looking forward! Thanks again

Pablo Blinder, PhD http://pblab.tau.ac.il http://pblab.tau.ac.il/en Neurobiology, Biochemistry and Biophysics School, George S. Wise Faculty of Life Sciences, and Sagol School for Neuroscience, Tel Aviv University

On Jul 14, 2022, at 17:15, Pat Gunn @.@.>> wrote:

It's pretty much whenever I think there's enough content to make it worth going through the release process. On average it's 2-3 weeks, I think.

— Reply to this email directly, view it on GitHubhttps://github.com/flatironinstitute/CaImAn/issues/996#issuecomment-1184502681, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALGGKPWXILPQDALBS2KAG3VUAOJZANCNFSM523D4B5A. You are receiving this because you authored the thread.Message ID: @.***>

pbl007 avatar Jul 15 '22 05:07 pbl007