spikeinterface
spikeinterface copied to clipboard
Error while running dockerized SpykingCircus
@alejoe91 @yger Running
sorting = ss.run_spykingcircus(recording=recording,
output_folder=f"/hdd/kampff/raw/Recordings/{dataset_id}/spykingcircus",
docker_image=True,
**sorter_params)
returns
Traceback (most recent call last):
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/runsorter.py", line 564, in run_sorter_container
sorting = SorterClass.get_result_from_folder(output_folder)
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/basesorter.py", line 281, in get_result_from_folder
sorting = cls._get_result_from_folder(output_folder)
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/spyking_circus/spyking_circus.py", line 169, in _get_result_from_folder
sorting = SpykingCircusSortingExtractor(folder_path=Path(output_folder) / 'recording')
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/extractors/spykingcircusextractors.py", line 67, in __init__
raise Exception(spykingcircus_folder, " is not a spyking circus folder")
Exception: (PosixPath('/hdd/kampff/raw/Recordings/c27/spykingcircus/recording'), ' is not a spyking circus folder')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kyu/repos/deepinterpolation_validation/processing/sort_di_sc.py", line 52, in <module>
main()
File "/home/kyu/repos/deepinterpolation_validation/processing/sort_di_sc.py", line 30, in main
sorting = ss.run_spykingcircus(recording=recording,
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/runsorter.py", line 679, in run_spykingcircus
return run_sorter('spykingcircus', *args, **kwargs)
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/runsorter.py", line 135, in run_sorter
return run_sorter_container(
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/sorters/runsorter.py", line 571, in run_sorter_container
sorting = NpzSortingExtractor.load_from_folder(npz_sorting_path)
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/core/base.py", line 565, in load_from_folder
return BaseExtractor.load(folder)
File "/home/kyu/miniconda3/envs/di_val/lib/python3.9/site-packages/spikeinterface/core/base.py", line 561, in load
raise ValueError('spikeinterface.Base.load() file_path must be an existing folder or file')
ValueError: spikeinterface.Base.load() file_path must be an existing folder or file
@khl02007 can you heck the log
file in the output folder?
here are the contents of spikeinterface_log.json
{
"sorter_name": "spykingcircus",
"sorter_version": "1.1.0",
"datetime": "2022-08-21T00:35:46.490310",
"runtime_trace": [
"",
"##################################################################",
"##### Welcome to the SpyKING CIRCUS #####",
"##### (1.1.0) #####",
"##### Written by P.Yger and O.Marre #####",
"##################################################################",
"",
"",
"File : /hdd/kampff/raw/Recordings/c27/spykingcircus/recording.npy",
"Steps : filtering, whitening, clustering, fitting, merging",
"Number of CPU : 18/36",
"Parallel HDF5 : False",
"Shared memory : True",
"Hostfile : /root/spyking-circus/circus.hosts",
"",
"##################################################################",
"",
"",
"------------------------- Informations -------------------------",
"| Number of recorded channels : 384",
"| Number of analyzed channels : 384",
"| File format : NUMPY",
"| Data type : float32",
"| Sampling rate : 30000 Hz",
"| Duration of the recording : 13 min 30 s 34 ms",
"| Width of the templates : 3 ms",
"| Spatial radius considered : 100 um",
"| Threshold crossing : negative",
"------------------------------------------------------------------",
"------------------------- Informations -------------------------",
"| Using only 18 out of 36 local CPUs available (-c to change)",
"------------------------------------------------------------------",
"\u001b[37mAnalyzing data to get whitening matrices and thresholds...",
"\u001b[37mFound 8.0906s to compute the whitening matrix...",
"\u001b[37mBecause of whitening, need to recompute the thresholds...",
"\u001b[37mSearching spikes to construct the PCA basis...",
"",
"0%| |[00:00<?, ?it/s]",
"17%|\u2588\u2588\u2588\u2588\u2588\u2588 |[00:53<04:29, 53.90s/it]",
"33%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[00:58<01:40, 25.12s/it]",
"100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|[00:58<00:00, 9.81s/it]",
"\u001b[37mFound 307188 waveforms over 307188 requested",
"\u001b[33m------------------------- Informations -------------------------",
"\u001b[33m| A basis with 5 dimensions has been built",
"\u001b[33m------------------------------------------------------------------",
"\u001b[37m",
"\u001b[37mSearching isolated random spikes to sample amplitudes...",
"",
"0%| |[00:00<?, ?it/s]",
"17%|\u2588\u2588\u2588\u2588\u2588\u2588 |[00:11<00:57, 11.41s/it]",
"33%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[00:21<00:41, 10.49s/it]",
"50%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[01:26<01:46, 35.55s/it]",
"67%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[01:36<00:51, 25.50s/it]",
"83%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[01:46<00:19, 19.97s/it]",
"100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|[02:33<00:00, 29.12s/it]",
"100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|[02:33<00:00, 25.63s/it]",
"\u001b[37mFound 1101429 spikes over 3071988 requested",
"\u001b[37mEstimating amplitudes distributions...",
"\u001b[37mSmart Search of good isolated spikes for the clustering (1/3)...",
"",
"0%| |[00:00<?, ?it/s]",
"17%|\u2588\u2588\u2588\u2588\u2588\u2588 |[00:13<01:07, 13.41s/it]",
"33%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[00:42<01:31, 22.91s/it]",
"50%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[00:56<00:55, 18.45s/it]",
"67%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[01:08<00:32, 16.22s/it]",
"83%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 |[01:21<00:14, 14.88s/it]",
"100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|[01:34<00:00, 14.26s/it]",
"100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|[01:34<00:00, 15.75s/it]",
"\u001b[37mFound 1100580 isolated spikes over 3071988 requested (1639 rejected)",
"",
"===================================================================================",
"= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES",
"= PID 294 RUNNING AT d12231b93380",
"= EXIT CODE: 135",
"= CLEANING UP REMAINING PROCESSES",
"= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES",
"===================================================================================",
"YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)",
"This typically refers to a problem with your application.",
"Please see the FAQ page for debugging suggestions",
"---------------------------- Error -----------------------------",
"| Step \"clustering\" failed for reason Command '['mpiexec', '-np', '18', 'spyking-circus-subtask', 'clustering', '/hdd/kampff/raw/Recordings/c27/spykingcircus/recording.npy', '18', '0', 'False', 'False']' returned non-zero exit status 143.!",
"------------------------------------------------------------------"
],
"error": false,
"run_time": 455.466142358011
}
Hi It looks indeed like a crash of SC, without proper warnings/reasons. What is the amount of RAM that you have on the machine? Can you monitor a bit the usage during the run?
@yger I have 64 GB of RAM. The recording is ~ 19 GB (~ 14 min, Neuropixels 1.0). I got this error when running with num_workers=-1
(default). When I changed this to num_workers=1
, sorting finishes without the error - so maybe it is a problem with RAM running out. But it is very slow (~4 hrs to sort). I will monitor the memory usage more and report back, but meanwhile do you see a way to speed things up?
This can be similar to #1034, but as said, I think that such a small recording should work fine even with 32Gb of RAM, so I am surprised and puzzled. When you are launching it with several cores, is the code faster at least? Because could be that MPI is not well installed. What do you have when you do, in a Python terminal
import mpi4py from mpi4py import MPI MPI.get_vendor()
@yger I agree that the problem is probably related to MPI but I'm not sure how to fix it. It throws an error when more than 1 worker is used. This happens regardless of whether I use dockerized SC or not.
Here is the output:
In [1]: import mpi4py
In [2]: from mpi4py import MPI
In [3]: MPI.get_vendor()
Out[3]: ('MPICH', (4, 0, 1))
Here is the error:
##################################################################
##### Welcome to the SpyKING CIRCUS #####
##### (1.1.0) #####
##### Written by P.Yger and O.Marre #####
##################################################################
File : /hdd/kampff/raw/Recordings/c31/spykingcircus_di/recording.npy
Steps : filtering, whitening, clustering, fitting, merging
Number of CPU : 18/36
Parallel HDF5 : False
Shared memory : True
Hostfile : /root/spyking-circus/circus.hosts
##################################################################
------------------------- Informations -------------------------
| Number of recorded channels : 384
| Number of analyzed channels : 384
| File format : NUMPY
| Data type : float32
| Sampling rate : 30000 Hz
| Duration of the recording : 13 min 30 s 34 ms
| Width of the templates : 3 ms
| Spatial radius considered : 100 um
| Threshold crossing : negative
------------------------------------------------------------------
------------------------- Informations -------------------------
| Using only 18 out of 36 local CPUs available (-c to change)
------------------------------------------------------------------
[37mAnalyzing data to get whitening matrices and thresholds...
[37mFound 8.0894s to compute the whitening matrix...
[37mBecause of whitening, need to recompute the thresholds...
[37mSearching spikes to construct the PCA basis...
0%| |[00:00<?, ?it/s]
17%|██████ |[00:29<02:25, 29.18s/it]
100%|████████████████████████████████████|[00:29<00:00, 4.87s/it]
[37mFound 307188 waveforms over 307188 requested
[33m------------------------- Informations -------------------------
[33m| A basis with 5 dimensions has been built
[33m------------------------------------------------------------------
[37m
[37mSearching isolated random spikes to sample amplitudes...
0%| |[00:00<?, ?it/s]
17%|██████ |[00:31<02:35, 31.11s/it]
33%|████████████ |[00:57<01:52, 28.14s/it]
50%|██████████████████ |[02:04<02:18, 46.23s/it]
67%|████████████████████████ |[02:29<01:15, 37.86s/it]
83%|██████████████████████████████ |[02:56<00:33, 33.60s/it]
100%|████████████████████████████████████|[03:59<00:00, 43.87s/it]
100%|████████████████████████████████████|[03:59<00:00, 39.97s/it]
[37mFound 1800234 spikes over 3071988 requested
[37mEstimating amplitudes distributions...
[37mSmart Search of good isolated spikes for the clustering (1/3)...
0%| |[00:00<?, ?it/s]
17%|██████ |[00:35<02:55, 35.17s/it]
33%|████████████ |[01:08<02:17, 34.30s/it]
50%|██████████████████ |[01:43<01:42, 34.33s/it]
67%|████████████████████████ |[02:14<01:06, 33.17s/it]
83%|██████████████████████████████ |[02:52<00:34, 34.82s/it]
100%|████████████████████████████████████|[03:28<00:00, 35.39s/it]
100%|████████████████████████████████████|[03:28<00:00, 34.81s/it]
[37mFound 1777148 isolated spikes over 3071988 requested (264190 rejected)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 307 RUNNING AT db5af9fb9f43
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
---------------------------- Error -----------------------------
| Step "clustering" failed for reason Command '['mpiexec', '-np', '18', 'spyking-circus-subtask', 'clustering', '/hdd/kampff/raw/Recordings/c31/spykingcircus_di/recording.npy', '18', '0', 'False', 'False']' returned non-zero exit status 143.!
------------------------------------------------------------------
@yger Continued from #1034 :
I tested the MPI version on my LAB computer as you instructed. This compupter has windows operating system.
Microsoft MPI , (10, 1, 0)
Sorting in this computer worked without problem.
In the Linux server in which I had the problem :
'MPICH', (4, 0, 1)
Consequently I feel the problem is not necessarily from the amount of RAM, but is from MPICH. I intially thought the amount of RAM (64GB) in my LAB computer is more than in the server, but I'm not sure about it. When I type the 'free' command in linux shell I see a lot more memory available. The reason I pointed the amount of available RAM is that I don't know the result of this command would give all the hard-disk space or only the RAM.
But if possible, I would like to test-run spyking-circus by Open-MPI on the server : If it runs without error, it means that not the RAM, but MPICH is the source of the problem. How can I do that ?
@yger @alejoe91
I have some new insights that I want to share. On my side, this problem is resolved !
As it turns out, it was not the amount of RAM or the MPI vendor that caused the problem. The cause of the problem was non-optimal choice of the number of parallel threads of execution (CPU core threads).
I performed some benchmarking tests (below) that brought some strange, but very exciting results. As I had mentioned before, the sorting of the same file ran without problem in the LAB workstation. I had initially pointed to the amount of RAM in that computer. But the importance was it's number of threads (or cores) : it has in total 20 CPU threads.
The server in my institute (at which I encountered this issue) has 144 parallel threads of execution (2 threads per core , 72 cores : 4 sockets * 18 cores per socket).
Below, I compare the spykingcircus run time while selecting different number of threads at each sorting trial in the server. The number of threads can be chosen by spikeinterface parameter (num_workers) :
num_workers : 10 => spykingcircus run time : 3147 s
num_workers : 20 => spykingcircus run time : 1674 s
num_workers : 40 => spykingcircus run time : 2069 s
num_workers : 60 => spykingcircus run time : 15053 s
num_workers : 72 => spykingcircus crashes !
num_workers : 144 => spykingcircus crashes !
My previous mindset was that the more the number of CPUs selected, the higher is the speed. Hence I selected the maximum number of CPUs in the server. This caused the program to crash.
As you see above, there is an optimal number of threads to choose (here 20), higher or lower than that which would decrease the speed, & even when very high would crash the program. At num_workers : 60 , the amount of time needed for running the program exponentially grows.
I Googled about this phenomenon & found some technical reasons about it. In fact, one must define an optimal number of threads while doing parallel processing.
Here I mention some of the sources that had delighted me :
https://superuser.com/questions/664550/optimum-number-of-threads-while-multitasking https://www.quora.com/What-is-the-optimal-number-of-threads-on-a-multicore-CPU-to-get-the-best-performance-when-executing-parallel-code https://pavelkazenin.wordpress.com/2014/08/02/optimal-number-of-threads-in-parallel-computing/
My suggestion is that in documentation of spikeinterface and spykingcircus we would add a suggestion to the user that guides them to benchmark the speed of execution of the program to find the condition with the highest speed. And that if they encounter crashes, one reason may be too many threads running in parallel.