Kilosort Slow running speed

Hi!

We are using kilosort for 32-channel recordings that are 10~15 hours long, and it's taking a really long time to process, so I am hoping to ask for some advice on this issue.

We have 16 shanks, each shank with 32 channels. Currently I'm using a loop to run kilosort on each shank separately. Some shanks took 3-4 hours, but a few shanks took 9-10 hours. I noticed that kilosort takes longer and longer to run as it is looped. Any idea for why this might be the case?
We are planning to upgrade our GPU. I read on the Kilosort Hardware Recommendation page that for longer recordings, "this situation typically requires more RAM, like 32 or 64 GB". May I check if this is referring to GPU or system memory? Also, since our current memory is sufficient to handle our data, do you think increasing memory, either in the system or GPU, would reduce runtime?

Thank you!

Aug 26 '24 00:08 Tingchen-G

Interesting, might be related to this SpikeInterface/spikeinterface#3332 . I also noticed that running kilosort in a loop sometimes causes weird behaviors.

Aug 26 '24 09:08 RobertoDF

As for the loop question, are you noticing that it takes longer on the third and forth loop, or just longer on the second loop like the linked issue in @RobertoDF's comment? If you're assigning the sorting results like:

for i in some_list:
    results = run_kilosort(...)

Then the variables in results will be kept in memory until the next loop completes (or longer if you're storing the results in a list for example), which will slow down sorting some since that memory won't be available in the meantime. Most of those aren't too big, but the memory for tF can add up fast for recordings with a lot of spikes.

For the "taking a long time" part, I can't really say much without some information about what hardware you're using. For reference, a Neuropixels recording 2-3 hours long on SSD is expected to take 2-3 hours to sort with a 8-12GB GeForce 3000 or 4000 series card, an i7 or better processor from the last few generations, and at least 32GB of system memory. A 32-channel recording should take less time; however, differences in hardware or spike counts could account for some of the gap.

Is there a reason you're sorting the shanks separately instead of all at once?

Aug 26 '24 14:08 jacobpennington

Thank you for your response! Yes the sorting takes longer on the second loop, just like in @RobertoDF's comment. But for every loop, I have del ops, st, clu, tF, Wall, similar_templates, is_ref, est_contam_rate, kept_spikes at the end, which I thought would clear the memory?

I am sorting the shanks separately because our recordings are very long, so I am worried that sorting shanks together would lead to "CUDA out of memory" issue.

And finally, just to clarify: in the Kilosort Hardware Recommendation page, "this situation typically requires more RAM, like 32 or 64 GB" --> is this referring to system memory?

Thank you!

Aug 26 '24 19:08 Tingchen-G

Yes that is referring to system memory. I'll look into the looping issue. I would also recommend trying out sorting it all together, and only sort separately if you run into errors since that should speed up the sorting quite a bit.

As for taking too long to run, can you please give some information about what hardware you're using? Specifically: graphics card, processor, amount of GPU memory and system memory, and are you sorting on SSD or HDD?

Aug 27 '24 00:08 jacobpennington

I see, I'll try sorting them all together. Regarding hardware, we're using GPU: GeForce GTX 1080Ti, 11GB memory; Processor: Intel i7-9700, 48GB memory, and we are sorting on SSD.

Aug 30 '24 14:08 Tingchen-G

Also, I noticed that the final clustering step takes the longest time. For a shank that took 11.5hrs to run, 13,844,472 spikes are extracted for first clustering, but 43,478,695 spikes are extracted for final clustering. Is it because too many spikes are extracted for final clustering? I'm using the default 9 and 8 for Th_universal and Th_learned.

Aug 30 '24 14:08 Tingchen-G

One other thing to check: can you make note of how many spikes were detected for each shank? I just want to make sure it's not a case where you happened to sort the shanks with more spikes later in the loop, which would of course take longer.

Another thing you can try is increasing the cluster_downsampling parameter, which would speed up the clustering steps. With that many spikes, you don't need to use as many for some of the clustering operations.

Sep 04 '24 15:09 jacobpennington

Sorry for the late reply! Here are the spike counts for each shank:

Shank 1: 23,946,723 Shank 2: 26,824,833 Shank 3: 40,672,509 Shank 4: 43,187,385 Shank 5: 32,859,009 Shank 6: 30,946,386 Shank 7: 26,166,955 Shank 8: 17,119,952 Shank 9: 5,001,869 Shank 10: 8,773,221 Shank 11: 22,833,448 Shank 12: 20,865,463 Shank 13: 22,793,711 Shank 14: 30,212,405 Shank 15: 27,891,315 Shank 16: 19,776,232 The spike counts vary significantly between shanks. I suspect the loop may be causing the slow runtime because I've noticed that when a shank takes too long, stopping the loop, restarting Anaconda Prompt and kilosort, and running a new loop from this same shank onwards would make it run much faster.

I'll definitely try increasing the cluster_downsampling parameter! Thanks!

Sep 10 '24 03:09 Tingchen-G

Thanks, still looking into this. Would it be possible for you to share the binary file and probe information for one of the shanks so that I can benchmark the memory usage in a loop? Any of the shanks with 20million or more spikes should work. We don't have datasets with a long duration like that available, so that would help me debug this issue and some related ones.

Sep 19 '24 18:09 jacobpennington

Hi!

Sorry for the delay. Sure, we could share the files. May I ask how to share the binary file? the compressed file is still too big to share on GitHub. Here is the probe information:

chanMap = np.arange(32)
kcoords = np.zeros(32)
n_chan = 32

xc_1_3 = np.ones(16) * 6.2
xc_2_4 = np.ones(16) * 6.2 + 30
xc = np.array([val for pair in zip(xc_1_3, xc_2_4) for val in pair])

yc_2_4 = np.array([15 + 6.2 + 30 * i for i in range(16)])
yc_1_3 = np.array([6.2 + 30 * k for k in range(16)])
yc = np.array([val for pair in zip(yc_1_3, yc_2_4) for val in pair])

probe = {
    'chanMap': chanMap,
    'xc': xc,
    'yc': yc,
    'kcoords': kcoords,
    'n_chan': n_chan
}

Thank you!

Oct 11 '24 14:10 Tingchen-G

The easiest way is to upload the data to google drive or dropbox, then paste a link to it here or email me the link at @.***

On Fri, Oct 11, 2024, 7:42 AM Tingchen-G @.***> wrote:

Hi!

Sorry for the delay. Sure, we could share the files. May I ask how to share the binary file? the compressed file is still too big to share on GitHub. Here is the probe information: `''' PROBE '''

chanMap = np.arange(32) kcoords = np.zeros(32) n_chan = 32 X-coordinates

xc_1_3 = np.ones(16) * 6.2 xc_2_4 = np.ones(16) * 6.2 + 30 xc = np.array([val for pair in zip(xc_1_3, xc_2_4) for val in pair]) Y-coordinates

yc_2_4 = np.array([15 + 6.2 + 30 * i for i in range(16)]) yc_1_3 = np.array([6.2 + 30 * k for k in range(16)]) yc = np.array([val for pair in zip(yc_1_3, yc_2_4) for val in pair]) Set up probe

probe = { 'chanMap': chanMap, 'xc': xc, 'yc': yc, 'kcoords': kcoords, 'n_chan': n_chan } ` Thank you!

— Reply to this email directly, view it on GitHub https://github.com/MouseLand/Kilosort/issues/767#issuecomment-2407562452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQ6WYHFNE7S233MTQYS5MDZ27PULAVCNFSM6AAAAABNC7AUKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBXGU3DENBVGI . You are receiving this because you commented.Message ID: @.***>

Oct 11 '24 19:10 jacobpennington

I see, sure! Here is the dropbox link: https://www.dropbox.com/scl/fi/4j65b003lqp3c5umfbhf0/shank3.zip?rlkey=lqz3fuepdbv3hkl99vswuz1ha&st=h1uaym6n&dl=0

Oct 11 '24 21:10 Tingchen-G

Hi,

I am now running kilosort on a new set of data of similar sizes, and this issue seems to be solved! Now each shank takes around 2hrs, which is quite reasonable considering our data size. I am now using kilosort4.0.18, and have added these lines to the end of the loop:

    with open('kilosort.log', 'w') as f:
        pass  

    del ops, st, clu, tF, Wall, similar_templates, is_ref, est_contam_rate, kept_spikes
    del camps, contam_pct, templates, chan_best, amplitudes, firing_rates, dshift

Thank you for your help!

Oct 13 '24 17:10 Tingchen-G