KiloSort
KiloSort copied to clipboard
Too large Nfilt breaks GPU
What some people in our lab have noticed is that setting ops.Nfilt to too large a value gives a CUDA unknown error after the template finding (when the fullMPMU function is called).
Since we are using anything between 750 and 1200 channels on our Neuroseeker probe we would sometimes set Nfilt to over 3k. That would break the GPU.
What works for us is around 2300 (we now usually set it between 3270 and 3272). We haven't tried intermediate values between 2304 (ok) and 3200 (break) so we cannot tell you exactly what values break the code.
The GPUs we have been using have had either 12 or 8 GB or memory.
It certainly feels like a 'GPU not enough memory' problem which could be solved by having the code check the required and available GPU memory at the start and notify the user before template finding even begun.
I see the same thing on a GTX 6800 (2GB memory) and 768 filters (512 work, didn't test intermediate values, yet). It happens both on long (2 hour) and short (2 minute) datasets (128 channels).
In my case the error happens in https://github.com/cortex-lab/KiloSort/blob/9d9bc00a4b5eb66b9e81420961249abc80a695f1/fullMPMU.m#L159 and the actual error is
Warning: An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_ILLEGAL_ADDRESS In fullMPMU (line 159)
When querying the gpudevice in Matlab using parallel.gpu.GPUDevice.getDevice(1) afterwards, it can't read the AvailableMemory anymore: the property is missing in the CUDADevice and directly trying to call it fails with
An unexpected error occurred during CUDA execution. The CUDA error was: an illegal memory access was encountered
It looks like an issue in mexMPmuFEAT.cu. When I save the fourier space templates during template fitting and then use cpuMPmuFEAT instead of mexMPmuFEAT, the GPU does not crash even with more filters (I continue using gpuArrays in fullMPMU and gather them when calling cpuMPmuFEAT).
(In the one case I've looked at a bit more, the crash happens in the second batch (ibatch==2), where *MPmuFEAT returns st=298 for ibatch==1 and st=156 for ibatch==2. Which may again mean that it's not an issue with accessing dataRAW in line 159, but with the low level CUDA code which I have no idea of.)
I experience the same issue.
My installation is able to run the demo properly (ops.Nfilt=64), but not our data for which we set ops.Nfilt=384*4.
I use the actual master branch of Kilosort, my OS is Ubuntu 16.04, my MATLAB version 2018a, and my graphic card specs are:
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:02:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:39 memory:f6000000-f6ffffff memory:e0000000-efffffff memory:f0000000-f1ffffff ioport:e000(size=128) memory:c0000-dffff
Here is the full log output:
Time 0s. Loading raw data...
Time 495s. Channel-whitening filters computed.
Time 495s. Loading raw data and applying filters...
Time 1892.28. Whitened data written to disk...
Time 1892.28. Preprocessing complete!
Time 1897s. Optimizing templates ...
Time 14103.05, batch 25981/25986, mu 17.79, neg-err 601850.749700, NTOT 12765985, n100 29843, n200 21013, n300 16060, n400 10880
Time 14148s. Running the final template matching pass...
Time 14150.18, batch 1/4331, NTOT 5204
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 159)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 3)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 3)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 3)
In master_file_spikes_DK (line 21)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
> In fullMPMU (line 3)
In master_file_spikes_DK (line 21)
Error using gpuArray/subsref
An unexpected error occurred during CUDA execution. The CUDA error was:
an illegal memory access was encountered
Error in fullMPMU (line 161)
datSp = dataRAW(inds(:), :);
Error in master_file_spikes_DK (line 21)
rez = fullMPMU(rez, DATA);% extract final spike times (overlapping extraction)
Thank you @marius10p .
I confirm that I do NOT get the error for ops.Nfilt=3843. I tried to diminish the size of the batches to compensate (down to ops.NT=32512+ops.ntbuff) and run it with ops.Nfilt=384*4 as desired, but the error occurred again.