Kilosort CUDA out of memory during saving to phy

Describe the issue:

Hi, Thanks for the great work. I've been using KS to sort long recordings (days) of NP1. I've managed to edit KS3 so that some of the memory-heavy computations will be done using the CPU memory, with the cost of slower running time. Moving to KS4, running using the GPU was possible again, but failed oddly during saving to phy. I will be happy if you can help me resolve this issue. Thanks Anan

The following is the output of KS4

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype. Using GPU for PyTorch computations. Specify device to change this. sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin using probe neuropixPhase3B1_kilosortChanMap.mat Preprocessing filters computed in 227.77s; total 227.85s

computing drift Re-computing universal templates from data. H:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning: Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program. Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning) 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:29:56<00:00, 1.09it/s] drift computed in 24592.70s; total 24820.55s

Extracting spikes using templates Re-computing universal templates from data. 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:34:03<00:00, 1.08it/s] 101617684 spikes extracted in 20305.22s; total 45127.25s

First clustering 100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [15:20:35<00:00, 575.37s/it] 742 clusters found, in 55302.18s; total 100429.43s

Extracting spikes using cluster waveforms 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [3:41:37<00:00, 1.62it/s] 119437390 spikes extracted in 13482.27s; total 113911.70s

Final clustering 100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [22:33:55<00:00, 846.20s/it] 492 clusters found, in 81236.76s; total 195148.93s

Merging clusters 471 units found, in 362.79s; total 195511.72s

Saving to phy and computing refractory periods Traceback (most recent call last): File "H:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "H:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details import(pkg_name) File "F:\PycharmProjects\kilosort4\ks4_main.py", line 23, in run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat') File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 146, in run_kilosort save_sorting(ops, results_dir, st, clu, tF, Wall, bfile.imin, tic0, File "H:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 472, in save_sorting results_dir, similar_templates, is_ref, est_contam_rate = io.save_to_phy( File "H:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 172, in save_to_phy xs, ys = compute_spike_positions(st, tF, ops) File "H:\envs\kilosort4_1\lib\site-packages\kilosort\postprocessing.py", line 39, in compute_spike_positions chs = ops['iCC'][:, ops['iU'][st[:,1]]].cpu() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.90 GiB. GPU 0 has a total capacity of 11.00 GiB of which 6.65 GiB is free. Of the allocated memory 922.21 MiB is allocated by PyTorch, and 543.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Apr 21 '24 07:04 ananmoran

Thanks for catching this, we're deciding on how to fix it. If you want to be able to sort in the meantime, since you mentioned you're comfortable modifying code, you could install from source and comment out a few lines to skip this step for now. It's only used for plotting the spike positions on the probe (which is nice, but not necessary).

The relevant lines are

214 and 215 in kilosort.gui.run_box.py (last two lines here):

  elif plot_type == 'probe':
      plot_window = self.plots['probe']
      ops = self.current_worker.ops
      st = self.current_worker.st
      clu = self.current_worker.clu
      tF = self.current_worker.tF
      is_refractory = self.current_worker.is_refractory
      device = self.parent.device
      # plot_spike_positions(plot_window, ops, st, clu, tF, is_refractory,
      #                      device)

172, 173, 181, and 185 in kilosort.io.py (all the ones that mention "spike_positions"):

    # spike properties
    spike_times = st[:,0].astype('int64') + imin  # shift by minimum sample index
    spike_templates = st[:,1].astype('int32')
    spike_clusters = clu
    # xs, ys = compute_spike_positions(st, tF, ops)             <---- 
    # spike_positions = np.vstack([xs, ys]).T                    <----
    amplitudes = ((tF**2).sum(axis=(-2,-1))**0.5).cpu().numpy()
    # remove duplicate (artifact) spikes
    spike_times, spike_clusters, kept_spikes = remove_duplicates(
        spike_times, spike_clusters, dt=ops['settings']['duplicate_spike_bins']
    )
    amp = amplitudes[kept_spikes]
    spike_templates = spike_templates[kept_spikes]
    # spike_positions = spike_positions[kept_spikes]            <----
    np.save((results_dir / 'spike_times.npy'), spike_times)
    np.save((results_dir / 'spike_templates.npy'), spike_clusters)
    np.save((results_dir / 'spike_clusters.npy'), spike_clusters)
    # np.save((results_dir / 'spike_positions.npy'), spike_positions)        <-----

Apr 23 '24 15:04 jacobpennington

Thanks. I am still straggling with the KS implementation of CUDA. I have found out that unless explicitly released using torch.cuda.empty_cache(), the cache is not emptied and "CUDA out of memory" causes the program to exit. When using torch.cuda.empty_cache() before and after every phase of the sorting I successfully managed to sort NP1 recording of 5 hours. I think you should add empty_cache() to your code, or have some flag to do it when desired. Unfortunately, KS4 still crashed with "CUDA out of memory" when I tried sorting a longer recording of 49h. There was a warning about scalar overflow, and then it tried to allocate 2.6TB of GPU memory. See the trace log below:

Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype. Using GPU for PyTorch computations. Specify device to change this. sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin using probe neuropixPhase3B1_kilosortChanMap.mat Preprocessing filters computed in 1003.20s; total 1003.29s

computing drift Re-computing universal templates from data. h:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning: Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program. Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

warnings.warn(msg, RuntimeWarning) 41%|████████████████████████████▊ | 35791/88200 [8:54:26<13:46:29, 1.06it/s]h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py:511: RuntimeWarning: overflow encountered in scalar add bend = min(self.imax, bstart + self.NT + 2*self.nt) 41%|████████████████████████████▊ | 35791/88200 [8:54:27<13:02:37, 1.12it/s] Traceback (most recent call last): File "h:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "h:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details import(pkg_name) File "F:\PycharmProjects\kilosort4\ks4_main.py", line 24, in run_kilosort(settings=settings, probe_name='neuropixPhase3B1_kilosortChanMap.mat') File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 131, in run_kilosort ops, bfile, st0 = compute_drift_correction( File "h:\envs\kilosort4_1\lib\site-packages\kilosort\run_kilosort.py", line 345, in compute_drift_correction ops, st = datashift.run(ops, bfile, device=device, progress_bar=progress_bar) File "h:\envs\kilosort4_1\lib\site-packages\kilosort\datashift.py", line 192, in run st, _, ops = spikedetect.run(ops, bfile, device=device, progress_bar=progress_bar) File "h:\envs\kilosort4_1\lib\site-packages\kilosort\spikedetect.py", line 246, in run X = bfile.padded_batch_to_torch(ibatch, ops) File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 709, in padded_batch_to_torch X = super().padded_batch_to_torch(ibatch) File "h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py", line 537, in padded_batch_to_torch X[:] = torch.from_numpy(data).to(self.device).float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2594.29 GiB. GPU 0 has a total capacity of 11.00 GiB of which 7.85 GiB is free. Of the allocated memory 199.10 MiB is allocated by PyTorch, and 30.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Apr 25 '24 15:04 ananmoran

Okay thanks, looking into it. Just to clarify, are you using the default settings to sort this? I.e. no changes to batch size, detection thresholds, etc.

Apr 25 '24 17:04 jacobpennington

I did not change any parameter. Thanks for looking into it. Anan

Apr 25 '24 17:04 ananmoran

Hi. Any news regarding this issue? Thanks Anan

May 06 '24 10:05 ananmoran

Not yet.

May 06 '24 23:05 jacobpennington

Re: the last error you described, I have a fix working. I'll push it after I test a few more things (probably today). The problem was caused because the large number of samples was causing an integer overflow that caused the program to try to load many batches at once.

As for the other memory issues you brought up, that will take longer to work on but it's on the to-do list. It sounds like using empty_cache() is working for you for whatever reason, but I don't want to add that to the code since it doesn't actually free up any memory that pytorch doesn't already have reserved. There are optimizations in the underlying sorting steps that we need to try instead, to reduce the amount of memory allocated in the first place, they just hadn't been a priority yet since most users' recordings are much shorter than this.

May 07 '24 19:05 jacobpennington

Thanks for the overflow fix. I will wait for your push and test it on my data.

WRT the GPU memory usage, I understand that it is not critical for most users, but I hope that the KS team will find time to optimize this, as recording time will surely grow fast inbyhe near future.

Thanks again for putting an effort to solve these problems.

Much obliged Anan

May 08 '24 05:05 ananmoran

I'm keeping memory optimization on the TODO list, but just FYI: There's a new clear_cache option for run_kilosort with the latest version, which adds a torch.empty_cache call in the clustering step. Similar calls may be added to other steps of the pipeline if people encounter more problems. My best guess currently is that there's a memory fragmentation issue causing these errors, but figuring out where it's coming from and how to fix it will take more work.

Aug 13 '24 19:08 jacobpennington

Thanks Jacob, I will test and report. Anan

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Jacob Pennington @.> Sent: Tuesday, August 13, 2024 10:49:12 PM To: MouseLand/Kilosort @.> Cc: ananmo @.>; Author @.> Subject: Re: [MouseLand/Kilosort] CUDA out of memory during saving to phy (Issue #670)

Closed #670https://github.com/MouseLand/Kilosort/issues/670 as completed.

— Reply to this email directly, view it on GitHubhttps://github.com/MouseLand/Kilosort/issues/670#event-13868407909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7N3L5WQ5EH24XNH7PKQCTZRJPLRAVCNFSM6AAAAABGREYBSOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTHA3DQNBQG44TAOI. You are receiving this because you authored the thread.Message ID: @.***>

Aug 13 '24 21:08 ananmoran