spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Request: timeout in `ChunkRecordingExecutor` (ProcessPoolExecutor)

Open miketrumpis opened this issue 2 years ago • 7 comments

I've been having some stalls in the ProcessPoolExecutor when creating some WaveformExtractor objects. Unfortunately, there aren't any factors that occur to me for debugging this. However, I made a quick fork from the v0.94.0 to include a timeout for the map call in the ProcessPoolExecutor, which at least raises an exception instead of hanging forever.

Very trivial changes. Happy to rebase this and PR

https://github.com/miketrumpis/spikeinterface/blob/multiproc/spikeinterface/core/job_tools.py

miketrumpis avatar Jul 12 '22 14:07 miketrumpis

Hi @miketrumpis

Thanks for the report. Can you point to the changes that you made?

Cheers Alessio

alejoe91 avatar Jul 13 '22 11:07 alejoe91

I see now you added a timeout!! Do you have any idea why the parallel process is failing? What OS are you using? and what is the input recording (e.g. any fancy preprocessing?)

alejoe91 avatar Jul 13 '22 12:07 alejoe91

Pretty sure I've only seen it on Ubuntu 20.04.4. Correct me if I'm wrong, I believe the 0.94 version only uses spawning for multiprocessing.

The stalling scenario is either using a basic WaveformExtractor or a custom extension, in the extract_waveforms_to_buffers stage. The preprocessing stages channel selection and CAR, nothing too intensive.

I've wondered whether it's my extensions that are abusing the process executor, but the logic is largely inherited, and I'm only setting the sparsity matrix to narrow the output.

The stalling can definitely happen when multiple independent processes are each spawning the process executors. I believe it can happen in a single process, but less certain.

Not sure it's relevant, but curious if anyone else sees resource warnings, as reported in this Python bug? https://github.com/python/cpython/issues/90549 (Note that I see this both in MacOS and Linux, since the previous SpikeInterface release is preferring to spawn.)

diffs https://github.com/SpikeInterface/spikeinterface/compare/master...miketrumpis:spikeinterface:multiproc

miketrumpis avatar Jul 13 '22 13:07 miketrumpis

Pretty sure I've only seen it on Ubuntu 20.04.4. Correct me if I'm wrong, I believe the 0.94 version only uses spawning for multiprocessing.

Actually it uses loky (default on ubuntu). On Windows default os spawn. We have a new job argument called mp_context. Can you try to run the parallel processing with this additional mp_context="spawn" argument?

The stalling scenario is either using a basic WaveformExtractor or a custom extension, in the extract_waveforms_to_buffers stage. The preprocessing stages channel selection and CAR, nothing too intensive.

I've wondered whether it's my extensions that are abusing the process executor, but the logic is largely inherited, and I'm only setting the sparsity matrix to narrow the output.

Can I ask you which extensions? If you have something cool in mind, I suggest to open an issue or open a draft PR and we can definitely provide support :)

The stalling can definitely happen when multiple independent processes are each spawning the process executors. I believe it can happen in a single process, but less certain.

Not sure it's relevant, but curious if anyone else sees resource warnings, as reported in this Python bug? python/cpython#90549 (Note that I see this both in MacOS and Linux, since the previous SpikeInterface release is preferring to spawn.)

diffs master...miketrumpis:spikeinterface:multiproc

Thanks for the diffs!

alejoe91 avatar Jul 13 '22 13:07 alejoe91

I will try to run under the git main soon and play with the multiprocessing context. I have a list of jobs that have failed, but not 100% sure the failure mode is deterministic. Unfortunately the extensions are on a private repo for my organization 😬

miketrumpis avatar Jul 13 '22 19:07 miketrumpis

@alejoe91 : we do not use loky. It was too bugy. The ProcessPoolExecutor is in python core.

@miketrumpis : I am not sure that this timout trick will be sustainable. it is very hard to predict the computation. For me when it hangs forever, it is because interanlly a chunk make a woker bugy for a strange reason but the error is not propagated to the main process. The best in that case is to use n_jobs=1 and track why a chunk trigger a bug.

samuelgarcia avatar Jul 13 '22 20:07 samuelgarcia

@alejoe91 : we do not use loky. It was too bugy. The ProcessPoolExecutor is in python core.

I was taking another look at this too--so I presume that uses fork for linux and spawn for mac. Another reason to think spawn might change behavior.

@miketrumpis : I am not sure that this timout trick will be sustainable. it is very hard to predict the computation. For me when it hangs forever, it is because interanlly a chunk make a woker bugy for a strange reason but the error is not propagated to the main process. The best in that case is to use n_jobs=1 and track why a chunk trigger a bug.

That's a fair point -- the way I wrote it does not allow for a "None" default (current behavior). Still, it would be nice to have the option for a timeout if requested specifically, e.g. in the parameters to WaveformExtractor.

I will try your n_jobs suggestion next time I see the problem.

miketrumpis avatar Jul 13 '22 21:07 miketrumpis