Kilosort icon indicating copy to clipboard operation
Kilosort copied to clipboard

BUG:GPU calling failure when starting first clustering

Open Alchemist-Y opened this issue 1 year ago • 14 comments

Describe the issue:

It is normal during the spike extracting, but when it comes to the first clustering part, kilosort will automatically choose the cpu to computer rather than the GPU.

Reproduce the bug:

GPU calling is falied during the first clustering

Error message:

No response

Version information:

kilosort: latest version; CUDA toolkit: 11.8; NVIDIA driver: windows server 2022 standard; GUP: NVIDIA RTX A2000 12GB

Alchemist-Y avatar Jul 23 '24 10:07 Alchemist-Y

I'm sorry, I'm having a hard time understanding the issue. What do you mean it "automatically chooses the cpu?" How are you determining that? Are you getting an error during sorting? What version of Kilosort4 is this?

jacobpennington avatar Jul 23 '24 20:07 jacobpennington

@jacobpennington Hi,I,m sorry about my stodgy description. the kilosort version is V 4.0.13. I mean, I choose the GPU as for the pytorch in the GUI, but during the clustering part, the GPU doesn't work at all observing in my task manager. And inversely, the CPU is 100% working.

Alchemist-Y avatar Jul 24 '24 00:07 Alchemist-Y

The task manager is not an accurate way to check GPU usage for python processes. If you want to check usage while sorting is running, you can use the nvidia-smi command in a terminal or powershell.

jacobpennington avatar Jul 24 '24 01:07 jacobpennington

@jacobpennington Hi,here I show the detail information about the GPU usage: extracting spike part and clustering part. image image

Alchemist-Y avatar Jul 24 '24 06:07 Alchemist-Y

Okay. Can you please clarify if you get an error during sorting, or think it's taking too long, or some other problem? Not all steps of the sorting process use the GPU.

jacobpennington avatar Jul 25 '24 02:07 jacobpennington

Okay. Can you please clarify if you get an error during sorting, or think it's taking too long, or some other problem? Not all steps of the sorting process use the GPU.

Yes, it takes too long to process in the first clustering part. Because in my own computer with GPU-NVIDIA 1650S 4GB, rather than in this win server, it can solve the entire sorting in about 3h, but in this environment, it takes 2 hours in the firsting clustering part and without any processing.

Alchemist-Y avatar Jul 25 '24 05:07 Alchemist-Y

Are you sure your pytorch installation is set up correctly? You said you're using CUDA toolkit version 11.8, but your screenshots show version 12.4. If you are using 12.4, I would recommend you try setting up a new environment using toolkit version 11.8 and then see if the issue persists. Some other users have reported difficulties using 12.4.

jacobpennington avatar Jul 27 '24 00:07 jacobpennington

@jacobpennington hi,this figure showed above means the highest edition CUDA which my server support, and what I've installed in CUDA 11.8 showed under. image

Alchemist-Y avatar Aug 09 '24 15:08 Alchemist-Y

Ah, I see. Unfortunately, I'm not sure what else I can try to debug for you if the sorting works fine on a different machine. The next thing I would try is the following, just to make sure nothing got messed up with the installation and there aren't any conflicts with other packages:

  1. Restart the machine.
  2. Create a new conda environment to use only for Kilosort (it looks like you're using the base environment right now).
  3. Follow the steps in the readme again to install Kilosort and pytorch.
  4. Retry the sorting.

If you still get the same error after trying that, please upload kilosort4.log from the results directory from the new sorting attempt. Screenshots of the Kilosort4 GUI with that recording loaded might also be helpful, if you're using the GUI.

jacobpennington avatar Aug 09 '24 17:08 jacobpennington

Hi,jacob,here is my log. Because it takes too long to process the fisrt clustering part, so I interrupt it. kilosort4.log

Alchemist-Y avatar Aug 27 '24 03:08 Alchemist-Y

I have exactly the same problem, "first clustering "extremely slow (>200 s/it) probably because no GPU usage, first 2 steps 2 it/s as expected. On Task manager I can isolate CUDA usage and it goes down at the "first clustering" step. I am calling kilosrt directly, not via spikeinterface. I reinstalled CUDA toolkit and pytorch but no improvement.

RobertoDF avatar Aug 29 '24 00:08 RobertoDF

Here a video of the issue https://www.dropbox.com/scl/fi/vaipnzqoxsyrrvbkuy3o8/Untitled.m4v?rlkey=watrlnib7q6a30a3sbzqovpoc&dl=0

RobertoDF avatar Aug 29 '24 23:08 RobertoDF

Here a video of the issue https://www.dropbox.com/scl/fi/vaipnzqoxsyrrvbkuy3o8/Untitled.m4v?rlkey=watrlnib7q6a30a3sbzqovpoc&dl=0

same with yours

Alchemist-Y avatar Aug 30 '24 02:08 Alchemist-Y

I could solve this by using a totally fresh environment, not just uninstall>reinstall. However now I get the CUDA out of memory error 😰

RobertoDF avatar Aug 31 '24 08:08 RobertoDF

Still unclear what's causing this since I can't reproduce it, but given that it appears to work for you on some machines and was resolved in at least one case by creating a new environment, I'm going to close the issue and recommend using environment.yml to create a new environment if you continue to have this problem. Instructions for that are at the bottom of the readme.

jacobpennington avatar Nov 06 '24 18:11 jacobpennington