pykilosort icon indicating copy to clipboard operation
pykilosort copied to clipboard

Memory error with 200GB dataset

Open m-beau opened this issue 5 years ago • 26 comments

Hi Cyrille,

I am using the script below to run pyKilosort, edited from your readMe example: MB_pyKilosort.txt

The whitening matrix gets computed normally and the preprocessing step complete gracefully.

If I re-run pyKilosort, here is the output I get (it skips the preprocessing step which has been done in a previous instance):

>>> run(dat_path=dat_path, probe=probe, params=params, dir_path=dir_path)

12:21:17.055 [I] main:48              Loaded raw data with 385 channels, 279066528 samples.
Params used:
{'fs': 30000.0, 'fshigh': 150.0, 'fslow': None, 'minfr_goodchannels': 0, 'Th': [10, 3], 'lam': 10, 'AUCsplit': 0.9, 'minFR': 0.02, 'momentum': [20, 400], 'sigmaMask': 30, 'ThPre': 8, 'spkTh': -6, 'reorder': 1, 'nskip': 25, 'nfilt_factor': 4, 'ntbuff': 64, 'whiteningRange': 32, 'nSkipCov': 25, 'scaleproc': 200, 'nPCs': 3, 'nt0': 61, 'nup': 10, 'sig': 1, 'gain': 1, 'templateScaling': 20.0, 'loc_range': [5, 4], 'long_range': [30, 6], 'NT': 32832, 'NTbuff': 33088, 'nt0min': 20}
12:21:17.060 [D] utils:261            Loading Wrot.npy
12:21:17.075 [I] utils:322            Starting step reorder.
C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:125: RuntimeWarning: overflow encountered in long_scalars
  offset = Nchan * batchstart[ibatch]
Clustering spikes:   2%|█▏                                                          | 170/8517 [00:47<38:01,  3.66it/s]C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:324: RuntimeWarning: overflow encountered in long_scalars
  dat = proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\main.py", line 139, in run
    out = clusterSingleBatches(ctx)
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py", line 470, in clusterSingleBatches
    uproj, call = extractPCbatch2(
  File "C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py", line 324, in extractPCbatch2
    dat = proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F')
MemoryError: Unable to allocate 192. GiB for an array with shape (103095615488,) and data type int16

m-beau avatar Feb 21 '20 12:02 m-beau

Thanks. Could you run the debugger in ipython ("%debug") and print the values of offset, NT, Nchan?

rossant avatar Feb 21 '20 12:02 rossant

Sorry for the delay I struggled a bit (using %pdb is easier apparently): offset: 2143272960 NT: 32832 Nchan: 384

m-beau avatar Feb 21 '20 12:02 m-beau

Ok, and what about those (still in the debugger)?

proc.flat.shape proc.flat[offset:offset + NT * Nchan].shape proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan)).shape proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F').shape

rossant avatar Feb 21 '20 12:02 rossant

len(proc.flat): 107377975296 (has no attribute shape) proc.flat[offset:offset + NT * Nchan].shape: returns emory error proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan)).shape: returns memory error proc.flat[offset:offset + NT * Nchan].reshape((-1, Nchan), order='F').shape: returns memory error

m-beau avatar Feb 21 '20 13:02 m-beau

Do you understand what's going on? I don't. In theory, proc.flat[offset:offset + NT * Nchan] should represent an array with NT*Nchan element, which is way smaller than 192 GB. Could you try:

proc.flat[0:NT * Nchan].shape proc.flat[0:10].shape proc.flat[offset:offset+10].shape proc.flat[offset]

and other variants to get a sense of what is failing? There might be a deeper issue related to memmap on Windows.

rossant avatar Feb 21 '20 13:02 rossant

it' s a bit odd: offset is 2143272960 NT * Nchan is 12607488 offset+NT * Nchan is -2139086848 !

Looks like we need more than 32 bits per integer!!

If I convert them into int64, offset+NT*Nchan is positive!!

m-beau avatar Feb 21 '20 13:02 m-beau

Then proc.flat[offset:offset + NT * Nchan].shape is (12607488,)

m-beau avatar Feb 21 '20 13:02 m-beau

Looks like we will need to add a lot os .astype(np.int64) all over the place...

m-beau avatar Feb 21 '20 13:02 m-beau

Ah! But what is an int32? Offset? NT? Nchan? If we were using native Python int types it should all be int64 if I'm not mistaken. Maybe one of these values comes from a NumPy array?

rossant avatar Feb 21 '20 13:02 rossant

I' ll need to rerun it to find out, I' ll let you know

m-beau avatar Feb 21 '20 13:02 m-beau

This is Python 3 64-bit on Windows 64-bit, correct?

rossant avatar Feb 21 '20 13:02 rossant

python 3.8.1 64 bits, windows 10 64 bits yep.

m-beau avatar Feb 21 '20 13:02 m-beau

It is offset, whose type() is numpy.int32 - all the other ones are native python integers.

int(offset)+NT*Nchan returns a positive number without overflow, as expected from 64bits-depth-integers.

m-beau avatar Feb 21 '20 14:02 m-beau

Ok, we should understand why offset is int32. Looking at this line it should really be an int64...

rossant avatar Feb 21 '20 14:02 rossant

l.117 batchstart = np.arange(0, NT * Nbatch + 1, NT).astype(np.int64) (cf. error message above, there was another overflow here C:\Users\Maxime\Anaconda3\envs\pyKilosort\lib\site-packages\pykilosort\cluster.py:125: RuntimeWarning: overflow encountered in long_scalars offset = Nchan * batchstart[ibatch] ) l. 321 batchstart = np.arange(0, NT * Nbatch + 1, NT).astype(np.int64) # batches start at these timepoints

Fixed the problem, now it's running further. I' ll let you know if it bugs at any further time point!

m-beau avatar Feb 21 '20 14:02 m-beau

excellent! Feel free to open a PR if you make it run to the end. Though I still don't understand why arange is not an int64 already (it is on my Linux machine)

rossant avatar Feb 21 '20 14:02 rossant

This is the issue: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine

Default integer type np.int_ is C long:

http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html

But C long is int32 in win64.

https://msdn.microsoft.com/en-us/library/9c3yd98k.aspx

This is kind of a weirdness of the win64 platform.

Hail Unix!!

It means that we need to explicit np.int64 type everywhere if there is a risk of very high array size

m-beau avatar Feb 21 '20 14:02 m-beau

I learnt something!! Good to know, and we should indeed cast to np.int64 evewhere it's needed. Hopefully you should find most spots by running it on a large dataset.

rossant avatar Feb 21 '20 14:02 rossant

There was a third spot - I've forked the repo and PR all the modifications I've done.

One also concerns the way params are handled: it is currently impossible to manually set the parameters defined in set_dependent_params() (importantly, the batch size params.NT!) because this function is run AFTER taking into account user-defined params (passed through the params argument of run() ). The workaround is simply to position set_dependent_params() BEFORE params.update(user_params) .

It is still running, now at the 'learn:l.710' stage.

m-beau avatar Feb 21 '20 19:02 m-beau

Hi cyrille,

I confirm that the same memory error occured during the final splitting step.

Isn't it odd that the memory error only occurs at this point? Isn't most of the work done then, is the gpu even necessary to split units?

Here is the full error log:

553 with 108511 spikes.                                         

22:54:05.305 [D] postprocess:443      Splitting template 343/553 with 418620 spikes.                                         
22:54:08.547 [D] postprocess:443      Splitting template 344/553 with 32475 spikes.                                          
22:54:09.583 [D] postprocess:443      Splitting template 344/554 with 23293 spikes.                                          
22:54:10.570 [D] postprocess:443      Splitting template 345/554 with 15461 spikes.                                          
22:54:11.430 [D] postprocess:443      Splitting template 346/554 with 381613 spikes.                                         
22:54:14.396 [D] postprocess:443      Splitting template 347/554 with 621990 spikes.                                         
22:54:15.559 [D] cptools:157          
Out of memory during convolve on x (621990, 96) and b (2001, 1), trying to split x.   
 ---------------------------------------------------------------------------                                                  
OutOfMemoryError                          
Traceback (most recent call last)                                                  ~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in convolve(x, b, axis)                                                    
154         with redirect_stderr(f):               --> 
155             return _convolve(x, b, axis=axis)                                                                            
156     except Exception: ~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in _convolve(x, b, axis)                                                   
144     xbf = xf * bf            --> 
145     y = cp.fft.irfft(xbf, axis=axis)                                                                                    
146     y = y[y.shape[axis] - xshape[axis]:, :]                                                                                                                                                                                                       

c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in irfft(a, n, axis, norm)                           
646     """                            -->
647     return _fft(a, (n,), (axis,), norm, cufft.CUFFT_INVERSE, 'C2R')                                                      
648                                                                                                                                                                                                                                                   

c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in _fft(a, s, axes, norm, direction, value_type, overwrite_x, plan)                                                                                                               
172             out_size = s[-1]                       --> 
173         a = _exec_fft(a, direction, value_type, norm, axes[-1], overwrite_x,                                            
174                       out_size)                                                                                                                                                                                                                   

c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\fft\fft.py in _exec_fft(a, direction, value_type, norm, axis, overwrite_x, out_size, out, plan)                                                                                              
112     else:                     --> 
113         out = plan.get_output_array(a)                                                                                  
114                                                                                                                                                                                                                                                   cupy\cuda\cufft.pyx in cupy.cuda.cufft.Plan1d.get_output_array()                                                                                                                                                                                          

c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\creation\basic.py in empty(shape, dtype, order)                  
21     """                     ---> 
22     return cupy.ndarray(shape, dtype, order=order)                                                                        
23                                                                                                                                                                                                                                                   

cupy\core\core.pyx in cupy.core.core.ndarray.__init__()                                                                                                                                                                                                   
cupy\cuda\memory.pyx in cupy.cuda.memory.alloc()                                                                                                                                                                                                          
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()                                                                                                                                                                                  
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()                                                                                                                                                                                 
cupy\cuda\memory.pyx in cupy.cuda.memory._try_malloc()                                                                                                                                                                                                    

OutOfMemoryError: Out of memory allocating 478,456,320 bytes (allocated so far: 7,250,273,780,736 bytes).                                                                                                                                                 
During handling of the above exception, another exception occurred:                                                                                                                                                                                       

OutOfMemoryError                          
Traceback (most recent call last)    <ipython-input-2-a0a5e2ad361c> in <module>                                                                                        
89                      
90                     ---> 
91 run(dat_path=dat_path, probe=probe, params=params, dir_path=dir_path)
                                                                                                                                                                             
~\Desktop\pyKilosort\pykilosort\pykilosort\main.py in run(dat_path, raw_data, probe, params, dir_path, stop_after)               
211         # final splits by SVD                                                                                            
212         with ctx.time('split_1'):                           --> 
213             out = splitAllClusters(ctx, True)                                                                            
214         # Use a different name for both splitting steps.                                                                 
215         out['st3_s1'] = out.pop('st3_s')                                                                                                                                                                                                          

~\Desktop\pyKilosort\pykilosort\pykilosort\postprocess.py in splitAllClusters(ctx, flag)                                         
465                                                                                                                          
466         # subtract a running average, because the projections are NOT drift corrected                                --> 
467         clpc = my_conv2(clp, 250, 0)                                                                                    
468         clp -= clpc                                                                                                      
469                                                                                                                                                                                      

~\Desktop\pyKilosort\pykilosort\pykilosort\preprocess.py in my_conv2(S1, sig, varargin)                                          
171         # custom GPU lfilter implementation below.                                                                       
172         cNorm = convolve(cp.ones((dsnew2[0], 1)), gaus).ravel()[:, cp.newaxis]                                       -->
173         S1 = convolve(S1, gaus)                                                                                         
174                                                                                                                          
175         # Slow Custom GPU lfilter implementation:                                                                                                                                                                                                 

~\Desktop\pyKilosort\pykilosort\pykilosort\cptools.py in convolve(x, b, axis)                                                    
165         for j in range(n):                                                                                               
166             ys.append(convolve(x[:, j:j + 1], b))                                                                    --> 
167         y = cp.concatenate(ys, axis=1 - axis)                                                                            
168         assert y.shape == x.shape                                                                                       
169         return y                                                                                                                                                                                                                                  

c:\users\maxime\anaconda3\envs\pykilosort\lib\site-packages\cupy\manipulation\join.py in concatenate(tup, axis)                   
52         tup = [m.ravel() for m in tup]                                                                                    
53         axis = 0                                                                                                     ---> 
54     return core.concatenate_method(tup, axis)                                                                             
55                                                                                                                           
56                                                                                                                              


cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation.concatenate_method()                                                                                                                                                             
cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation.concatenate_method()                                                                                                                                                             
cupy\core\_routines_manipulation.pyx in cupy.core._routines_manipulation._concatenate()                                                                                                                                                                   
cupy\core\core.pyx in cupy.core.core.ndarray.__init__()                                                                                                                                                                                                   
cupy\cuda\memory.pyx in cupy.cuda.memory.alloc()                                                                                                                                                                                                          
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.MemoryPool.malloc()                                                                                                                                                                                              
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()                                                                                                                                                                                  
cupy\cuda\memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()                                                                                                                                                                                 
cupy\cuda\memory.pyx in cupy.cuda.memory._try_malloc()                                                                                                                                                                                                    
OutOfMemoryError: Out of memory allocating 477,688,320 bytes (allocated so far: 7,250,273,780,736 bytes).                    > 
c:\users\maxime\desktop\pykilosort\pykilosort\cupy\cuda\memory.pyx(780)cupy.cuda.memory._try_malloc()

m-beau avatar Feb 24 '20 20:02 m-beau

How much memory do you have on your gpu? I've had many problems with this line, the gpu fft uses a lot of gpu memory, other gpu or cpu implementations seem prohibitively slow...

rossant avatar Feb 24 '20 22:02 rossant

That's the nvidia 1080 GTX titan https://www.nvidia.com/fr-fr/geforce/products/10series/geforce-gtx-1080/

So it's a pretty high end card :/

m-beau avatar Feb 24 '20 22:02 m-beau

Yes but it "only" has 8 GB of RAM (like mine) and it seems that one would need 12 or 16 GB (e.g. Quadro) on large datasets. But ideally we should make it work on 8 GB GPUs. This specific line is the main problem and we have to think about other solutions. If you look at the code you'll see that I already implemented a hacky work around which is to do the FFT one channel at a time when we're close to the GPU RAM limit. In your case this hack is not enough.

rossant avatar Feb 25 '20 08:02 rossant

Ok got it thanks for your input Cyrille.

So I guess 2 questions remain:

  1. Are there any remaining hacks you can think off to make the fft less memory hungry?

  2. Is there a way to know ahead of time what is the necessary amount of gpu RAM to run a given dataset, to know whether it’s worth throwing 3k£ at a fancier GPU?

m-beau avatar Feb 25 '20 11:02 m-beau

  1. Yes, we'll be working on it in the next few days.

rossant avatar Feb 26 '20 09:02 rossant

Awesome!

m-beau avatar Feb 26 '20 14:02 m-beau