CUDA out of memory during saving to phy
Describe the issue:
Hi, Thanks for the great work. I've been using KS to sort long recordings (days) of NP1. I've managed to edit KS3 so that some of the memory-heavy computations will be done using the CPU memory, with the cost of slower running time. Moving to KS4, running using the GPU was possible again, but failed oddly during saving to phy. I will be happy if you can help me resolve this issue. Thanks Anan
The following is the output of KS4
Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 227.77s; total 227.85s
computing drift Re-computing universal templates from data. H:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning: Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program. Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
warnings.warn(msg, RuntimeWarning) 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:29:56<00:00, 1.09it/s] drift computed in 24592.70s; total 24820.55s
Extracting spikes using templates Re-computing universal templates from data. 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [5:34:03<00:00, 1.08it/s] 101617684 spikes extracted in 20305.22s; total 45127.25s
First clustering 100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [15:20:35<00:00, 575.37s/it] 742 clusters found, in 55302.18s; total 100429.43s
Extracting spikes using cluster waveforms 100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [3:41:37<00:00, 1.62it/s] 119437390 spikes extracted in 13482.27s; total 113911.70s
Final clustering 100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [22:33:55<00:00, 846.20s/it] 492 clusters found, in 81236.76s; total 195148.93s
Merging clusters 471 units found, in 362.79s; total 195511.72s
Saving to phy and computing refractory periods
Traceback (most recent call last):
File "H:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "H:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 23, in
Thanks for catching this, we're deciding on how to fix it. If you want to be able to sort in the meantime, since you mentioned you're comfortable modifying code, you could install from source and comment out a few lines to skip this step for now. It's only used for plotting the spike positions on the probe (which is nice, but not necessary).
The relevant lines are
214 and 215 in kilosort.gui.run_box.py (last two lines here):
elif plot_type == 'probe':
plot_window = self.plots['probe']
ops = self.current_worker.ops
st = self.current_worker.st
clu = self.current_worker.clu
tF = self.current_worker.tF
is_refractory = self.current_worker.is_refractory
device = self.parent.device
# plot_spike_positions(plot_window, ops, st, clu, tF, is_refractory,
# device)
172, 173, 181, and 185 in kilosort.io.py (all the ones that mention "spike_positions"):
# spike properties
spike_times = st[:,0].astype('int64') + imin # shift by minimum sample index
spike_templates = st[:,1].astype('int32')
spike_clusters = clu
# xs, ys = compute_spike_positions(st, tF, ops) <----
# spike_positions = np.vstack([xs, ys]).T <----
amplitudes = ((tF**2).sum(axis=(-2,-1))**0.5).cpu().numpy()
# remove duplicate (artifact) spikes
spike_times, spike_clusters, kept_spikes = remove_duplicates(
spike_times, spike_clusters, dt=ops['settings']['duplicate_spike_bins']
)
amp = amplitudes[kept_spikes]
spike_templates = spike_templates[kept_spikes]
# spike_positions = spike_positions[kept_spikes] <----
np.save((results_dir / 'spike_times.npy'), spike_times)
np.save((results_dir / 'spike_templates.npy'), spike_clusters)
np.save((results_dir / 'spike_clusters.npy'), spike_clusters)
# np.save((results_dir / 'spike_positions.npy'), spike_positions) <-----
Thanks. I am still straggling with the KS implementation of CUDA. I have found out that unless explicitly released using torch.cuda.empty_cache(), the cache is not emptied and "CUDA out of memory" causes the program to exit. When using torch.cuda.empty_cache() before and after every phase of the sorting I successfully managed to sort NP1 recording of 5 hours. I think you should add empty_cache() to your code, or have some flag to do it when desired. Unfortunately, KS4 still crashed with "CUDA out of memory" when I tried sorting a longer recording of 49h. There was a warning about scalar overflow, and then it tried to allocate 2.6TB of GPU memory. See the trace log below:
Interpreting binary file as default dtype='int16'. If data was saved in a different format, specify data_dtype.
Using GPU for PyTorch computations. Specify device to change this.
sorting G:\NDR21\NDR21_hab3ToExt_g0\NDR21_hab3ToExt_g0_imec0\NDR21_hab3ToExt_g0_t0.imec0.ap.bin
using probe neuropixPhase3B1_kilosortChanMap.mat
Preprocessing filters computed in 1003.20s; total 1003.29s
computing drift Re-computing universal templates from data. h:\envs\kilosort4_1\lib\site-packages\threadpoolctl.py:1223: RuntimeWarning: Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program. Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
warnings.warn(msg, RuntimeWarning)
41%|████████████████████████████▊ | 35791/88200 [8:54:26<13:46:29, 1.06it/s]h:\envs\kilosort4_1\lib\site-packages\kilosort\io.py:511: RuntimeWarning: overflow encountered in scalar add
bend = min(self.imax, bstart + self.NT + 2*self.nt)
41%|████████████████████████████▊ | 35791/88200 [8:54:27<13:02:37, 1.12it/s]
Traceback (most recent call last):
File "h:\envs\kilosort4_1\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "h:\envs\kilosort4_1\lib\runpy.py", line 110, in _get_module_details
import(pkg_name)
File "F:\PycharmProjects\kilosort4\ks4_main.py", line 24, in
Okay thanks, looking into it. Just to clarify, are you using the default settings to sort this? I.e. no changes to batch size, detection thresholds, etc.
I did not change any parameter. Thanks for looking into it. Anan
Hi. Any news regarding this issue? Thanks Anan
Not yet.
Re: the last error you described, I have a fix working. I'll push it after I test a few more things (probably today). The problem was caused because the large number of samples was causing an integer overflow that caused the program to try to load many batches at once.
As for the other memory issues you brought up, that will take longer to work on but it's on the to-do list. It sounds like using empty_cache() is working for you for whatever reason, but I don't want to add that to the code since it doesn't actually free up any memory that pytorch doesn't already have reserved. There are optimizations in the underlying sorting steps that we need to try instead, to reduce the amount of memory allocated in the first place, they just hadn't been a priority yet since most users' recordings are much shorter than this.
Thanks for the overflow fix. I will wait for your push and test it on my data.
WRT the GPU memory usage, I understand that it is not critical for most users, but I hope that the KS team will find time to optimize this, as recording time will surely grow fast inbyhe near future.
Thanks again for putting an effort to solve these problems.
Much obliged Anan
I'm keeping memory optimization on the TODO list, but just FYI:
There's a new clear_cache option for run_kilosort with the latest version, which adds a torch.empty_cache call in the clustering step. Similar calls may be added to other steps of the pipeline if people encounter more problems. My best guess currently is that there's a memory fragmentation issue causing these errors, but figuring out where it's coming from and how to fix it will take more work.
Thanks Jacob, I will test and report. Anan
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: Jacob Pennington @.> Sent: Tuesday, August 13, 2024 10:49:12 PM To: MouseLand/Kilosort @.> Cc: ananmo @.>; Author @.> Subject: Re: [MouseLand/Kilosort] CUDA out of memory during saving to phy (Issue #670)
Closed #670https://github.com/MouseLand/Kilosort/issues/670 as completed.
— Reply to this email directly, view it on GitHubhttps://github.com/MouseLand/Kilosort/issues/670#event-13868407909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7N3L5WQ5EH24XNH7PKQCTZRJPLRAVCNFSM6AAAAABGREYBSOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTHA3DQNBQG44TAOI. You are receiving this because you authored the thread.Message ID: @.***>