pykilosort
                                
                                
                                
                                    pykilosort copied to clipboard
                            
                            
                            
                        Memory error with 200GB dataset
Hi Cyrille,
I am using the script below to run pyKilosort, edited from your readMe example: MB_pyKilosort.txt
The whitening matrix gets computed normally and the preprocessing step complete gracefully.
If I re-run pyKilosort, here is the output I get (it skips the preprocessing step which has been done in a previous instance):
>>> run(dat_path=dat_path, probe=probe, params=params, dir_path=dir_path)
12:21:17.055 [I] main:48              Loaded raw data with 385 channels, 279066528 samples.
Params used:
{'fs': 30000.0, 'fshigh': 150.0, 'fslow': None, 'minfr_goodchannels': 0, 'Th': [10, 3], 'lam': 10, 'AUCsplit': 0.9, 'minFR': 0.02, 'momentum': [20, 400], 'sigmaMask': 30, 'ThPre': 8, 'spkTh': -6, 'reorder': 1, 'nskip': 25, 'nfilt_factor': 4, 'ntbuff': 64, 'whiteningRange': 32, 'nSkipCov': 25, 'scaleproc': 200, 'nPCs': 3, 'nt0': 61, 'nup': 10, 'sig': 1, 'gain': 1, 'templateScaling': 20.0, 'loc_range': [5, 4], 'long_range': [30, 6], 'NT': 32832, 'NTbuff': 33088, 'nt0min': 20}
12:21:17.060 [D] utils:261            Loading Wrot.npy
12:21:17.075 [I] utils:322            Starting step reorder.
C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:125: RuntimeWarning: overflow encountered in long_scalars
  offset = Nchan * batchstart[ibatch]
Clustering spikes:   2%|█▏                                                          | 170/8517 [00:47<38:01,  3.66it/s]C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:324: RuntimeWarning: overflow encountered in long_scalars
  dat = proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\main.py", line 139, in run
    out = clusterSingleBatches(ctx)
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py", line 470, in clusterSingleBatches
    uproj, call = extractPCbatch2(
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py", line 324, in extractPCbatch2
    dat = proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F')
MemoryError: Unable to allocate 192. GiB for an array with shape (103095615488,) and data type int16
                                    
                                    
                                    
                                
Thanks. Could you run the debugger in ipython ("%debug") and print the values of offset, NT, Nchan?
Sorry for the delay I struggled a bit (using %pdb is easier apparently): offset: 2143272960 NT: 32832 Nchan: 384
Ok, and what about those (still in the debugger)?
proc.flat.shape proc.flat[offset:offset + NT * Nchan].shape proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan)).shape proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F').shape
len(proc.flat): 107377975296 (has no attribute shape) proc.flat[offset:offset + NT * Nchan].shape: returns emory error proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan)).shape: returns memory error proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F').shape: returns memory error
Do you understand what's going on? I don't. In theory, proc.flat[offset:offset + NT * Nchan] should represent an array with NT*Nchan element, which is way smaller than 192 GB. Could you try:
proc.flat[0:NT * Nchan].shape proc.flat[0:10].shape proc.flat[offset:offset+10].shape proc.flat[offset]
and other variants to get a sense of what is failing? There might be a deeper issue related to memmap on Windows.
it' s a bit odd: offset is 2143272960 NT * Nchan is 12607488 offset+NT * Nchan is -2139086848 !
Looks like we need more than 32 bits per integer!!
If I convert them into int64, offset+NT*Nchan is positive!!
Then proc.flat[offset:offset + NT * Nchan].shape is (12607488,)
Looks like we will need to add a lot os .astype(np.int64) all over the place...
Ah! But what is an int32? Offset? NT? Nchan? If we were using native Python int types it should all be int64 if I'm not mistaken. Maybe one of these values comes from a NumPy array?
I' ll need to rerun it to find out, I' ll let you know
This is Python 3 64-bit on Windows 64-bit, correct?
python 3.8.1 64 bits, windows 10 64 bits yep.
It is offset, whose type() is numpy.int32 - all the other ones are native python integers.
int(offset)+NT*Nchan returns a positive number without overflow, as expected from 64bits-depth-integers.
Ok, we should understand why offset is int32. Looking at this line it should really be an int64...
l.117    batchstart = np.arange(0, NT * Nbatch + 1, NT).astype(np.int64) (cf. error message above, there was another overflow here
C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:125: RuntimeWarning: overflow encountered in long_scalars
offset = Nchan * batchstart[ibatch]
)
l. 321     batchstart = np.arange(0, NT * Nbatch + 1, NT).astype(np.int64)  # batches start at these timepoints
Fixed the problem, now it's running further. I' ll let you know if it bugs at any further time point!
excellent! Feel free to open a PR if you make it run to the end. Though I still don't understand why arange is not an int64 already (it is on my Linux machine)
This is the issue: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine
Default integer type np.int_ is C long:
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
But C long is int32 in win64.
https://msdn.microsoft.com/en-us/library/9c3yd98k.aspx
This is kind of a weirdness of the win64 platform.
Hail Unix!!
It means that we need to explicit np.int64 type everywhere if there is a risk of very high array size
I learnt something!! Good to know, and we should indeed cast to np.int64 evewhere it's needed. Hopefully you should find most spots by running it on a large dataset.
There was a third spot - I've forked the repo and PR all the modifications I've done.
One also concerns the way params are handled: it is currently impossible to manually set the parameters defined in set_dependent_params() (importantly, the batch size params.NT!) because this function is run AFTER taking into account user-defined params (passed through the params argument of run() ). The workaround is simply to position set_dependent_params() BEFORE params.update(user_params) .
It is still running, now at the 'learn:l.710' stage.
Hi cyrille,
I confirm that the same memory error occured during the final splitting step.
Isn't it odd that the memory error only occurs at this point? Isn't most of the work done then, is the gpu even necessary to split units?
Here is the full error log:
553 with 108511 spikes.                                         
22:54:05.305 [D] postprocess:443      Splitting template 343/553 with 418620 spikes.                                         
22:54:08.547 [D] postprocess:443      Splitting template 344/553 with 32475 spikes.                                          
22:54:09.583 [D] postprocess:443      Splitting template 344/554 with 23293 spikes.                                          
22:54:10.570 [D] postprocess:443      Splitting template 345/554 with 15461 spikes.                                          
22:54:11.430 [D] postprocess:443      Splitting template 346/554 with 381613 spikes.                                         
22:54:14.396 [D] postprocess:443      Splitting template 347/554 with 621990 spikes.                                         
22:54:15.559 [D] cptools:157          
Out of memory during convolve on x (621990, 96) and b (2001, 1), trying to split x.   
 ---------------------------------------------------------------------------                                                  
OutOfMemoryError                          
Traceback (most recent call last)                                                  ~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in convolve(x, b, axis)                                                    
154         with redirect_stderr(f):               --> 
155             return _convolve(x, b, axis=axis)                                                                            
156     except Exception: ~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in _convolve(x, b, axis)                                                   
144     xbf = xf * bf            --> 
145     y = cp.fft.irfft(xbf, axis=axis)                                                                                    
146     y = y[y.shape[axis] - xshape[axis]:, :]                                                                                                                                                                                                       
c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in irfft(a, n, axis, norm)                           
646     """                            -->
647     return _fft(a, (n,), (axis,), norm, cufft.CUFFT_INVERSE, 'C2R')                                                      
648                                                                                                                                                                                                                                                   
c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in _fft(a, s, axes, norm, direction, value_type, overwrite_x, plan)                                                                                                               
172             out_size = s[-1]                       --> 
173         a = _exec_fft(a, direction, value_type, norm, axes[-1], overwrite_x,                                            
174                       out_size)                                                                                                                                                                                                                   
c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in _exec_fft(a, direction, value_type, norm, axis, overwrite_x, out_size, out, plan)                                                                                              
112     else:                     --> 
113         out = plan.get_output_array(a)                                                                                  
114                                                                                                                                                                                                                                                   cupy\cuda\cufft.pyx in cupy.cuda.cufft.Plan1d.get_output_array()                                                                                                                                                                                          
c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\creation\basic.py in empty(shape, dtype, order)                  
21     """                     ---> 
22     return cupy.ndarray(shape, dtype, order=order)                                                                        
23                                                                                                                                                                                                                                                   
cupy\core\core.pyx in cupy.core.core.ndarray.__init__()                                                                                                                                                                                                   
cupy\cuda\memory.pyx in cupy.cuda.memory.alloc()                                                                                                                                                                                                          
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()                                                                                                                                                                                  
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()                                                                                                                                                                                 
cupy\cuda\memory.pyx in cupy.cuda.memory._try_malloc()                                                                                                                                                                                                    
OutOfMemoryError: Out of memory allocating 478,456,320 bytes (allocated so far: 7,250,273,780,736 bytes).                                                                                                                                                 
During handling of the above exception, another exception occurred:                                                                                                                                                                                       
OutOfMemoryError                          
Traceback (most recent call last)    <ipython-input-2-a0a5e2ad361c> in <module>                                                                                        
89                      
90                     ---> 
91 run(dat_path=dat_path, probe=probe, params=params, dir_path=dir_path)
                                                                                                                                                                             
~\Desktop\pyKilosort\pykilosort\pykilosort\main.py in run(dat_path, raw_data, probe, params, dir_path, stop_after)               
211         # final splits by SVD                                                                                            
212         with ctx.time('split_1'):                           --> 
213             out = splitAllClusters(ctx, True)                                                                            
214         # Use a different name for both splitting steps.                                                                 
215         out['st3_s1'] = out.pop('st3_s')                                                                                                                                                                                                          
~\Desktop\pyKilosort\pykilosort\pykilosort\postprocess.py in splitAllClusters(ctx, flag)                                         
465                                                                                                                          
466         # subtract a running average, because the projections are NOT drift corrected                                --> 
467         clpc = my_conv2(clp, 250, 0)                                                                                    
468         clp -= clpc                                                                                                      
469                                                                                                                                                                                      
~\Desktop\pyKilosort\pykilosort\pykilosort\preprocess.py in my_conv2(S1, sig, varargin)                                          
171         # custom GPU lfilter implementation below.                                                                       
172         cNorm = convolve(cp.ones((dsnew2[0], 1)), gaus).ravel()[:, cp.newaxis]                                       -->
173         S1 = convolve(S1, gaus)                                                                                         
174                                                                                                                          
175         # Slow Custom GPU lfilter implementation:                                                                                                                                                                                                 
~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in convolve(x, b, axis)                                                    
165         for j in range(n):                                                                                               
166             ys.append(convolve(x[:, j:j + 1], b))                                                                    --> 
167         y = cp.concatenate(ys, axis=1 - axis)                                                                            
168         assert y.shape == x.shape                                                                                       
169         return y                                                                                                                                                                                                                                  
c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\manipulation\join.py in concatenate(tup, axis)                   
52         tup = [m.ravel() for m in tup]                                                                                    
53         axis = 0                                                                                                     ---> 
54     return core.concatenate_method(tup, axis)                                                                             
55                                                                                                                           
56                                                                                                                              
cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation.concatenate_method()                                                                                                                                                             
cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation.concatenate_method()                                                                                                                                                             
cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation._concatenate()                                                                                                                                                                   
cupy\core\core.pyx in cupy.core.core.ndarray.__init__()                                                                                                                                                                                                   
cupy\cuda\memory.pyx in cupy.cuda.memory.alloc()                                                                                                                                                                                                          
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()                                                                                                                                                                                  
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()                                                                                                                                                                                 
cupy\cuda\memory.pyx in cupy.cuda.memory._try_malloc()                                                                                                                                                                                                    
OutOfMemoryError: Out of memory allocating 477,688,320 bytes (allocated so far: 7,250,273,780,736 bytes).                    > 
c:\users\maxime\desktop\pykilosort\pykilosort\cupy\cuda\memory.pyx(780)cupy.cuda.memory._try_malloc()
                                    
                                    
                                    
                                
How much memory do you have on your gpu? I've had many problems with this line, the gpu fft uses a lot of gpu memory, other gpu or cpu implementations seem prohibitively slow...
That's the nvidia 1080 GTX titan https://www.nvidia.com/fr-fr/geforce/products/10series/geforce-gtx-1080/
So it's a pretty high end card :/
Yes but it "only" has 8 GB of RAM (like mine) and it seems that one would need 12 or 16 GB (e.g. Quadro) on large datasets. But ideally we should make it work on 8 GB GPUs. This specific line is the main problem and we have to think about other solutions. If you look at the code you'll see that I already implemented a hacky work around which is to do the FFT one channel at a time when we're close to the GPU RAM limit. In your case this hack is not enough.
Ok got it thanks for your input Cyrille.
So I guess 2 questions remain:
- 
Are there any remaining hacks you can think off to make the fft less memory hungry?
 - 
Is there a way to know ahead of time what is the necessary amount of gpu RAM to run a given dataset, to know whether it’s worth throwing 3k£ at a fancier GPU?
 
- Yes, we'll be working on it in the next few days.
 
Awesome!