ipyparallel icon indicating copy to clipboard operation
ipyparallel copied to clipboard

kill engines by pid & client.shutdown blocks

Open dereneaton opened this issue 8 years ago • 2 comments

Hey @minrk and folks, in Issue #141 you recommend to use os.kill to interrupt engines by pid. I've been trying this approach, however, it seems that when engines are killed by pid you can no longer stop the ipcluster instance using client.shutdown without it blocking indefinitely. Is there a better way? I'd like to be able to interrupt all running jobs if, for example, a KeyboardInterrupt is raise, and then still be able to close the ipcluster instance remotely.

Here is an example where I run this first block of code followed by either of the following two code blocks:

import ipyparallel as ipp
import signal
import time
import os

## open client view
client = ipp.Client()

## get engines pids
engine_pids = client[:].apply(os.getpid).get_dict()

If I interrupt the engines then shutdown blocks indefinitely:

## submit a job to run for a while
for i in range(4):
    async = client[i].apply(time.sleep, 30)

## interrupt each engine job before it finishes
for eng in engine_pids:
    os.kill(engine_pids[eng], signal.SIGINT)

## kill the ipcluster instance
client.shutdown(hub=True, block=False)

Or if I just try shutdown without interrupting engines it doesn't block indefinitely, but it does still block for the full 30 seconds, whether or not I tell it block=False.

## submit a job to run for a while
for i in range(4):
    async = client[i].apply(time.sleep, 30)

## soft shutdown
client.shutdown(hub=True, block=False)

dereneaton avatar Feb 11 '17 19:02 dereneaton

It must be trying to send a shutdown request to the down engine. There's likely some bookkeeping that's not correctly handling that the engine is gone.

minrk avatar Feb 20 '17 20:02 minrk

Having worked for a bit on #514 now, I think I understand the various issues at play here.

For the second case, this is behaving as intended: a shutdown request is queued because it is a polite request to shut down. So it does not interrupt engines that are working on a task, they will receive and process the shutdown request when they are done processing pending computations, and shut down appropriately. So in that sense, the second case is working as intended.

The first case I believe is a race: os.kill is shutting down engines, but the client.shutdown is sending a polite request to the same engines. This is because the client doesn't know that those engines are gone yet - it takes a finite amount of time for the client to be notified of engines shutting down. If you wait for client.ids to be empty before calling client.shutdown(hub=True), it ought to work.

There are at least two bugs in this case:

  1. the engine requests should be failing with 'engine died' if such an event arrives after the request is sent out, handled appropriately
  2. it should be possible to shutdown just the hub, but it currently unconditionally sets targets='all' if hub is True

If you do want a forceful shutdown, the new Cluster API provides this, terminating engines immediately with a signal.

minrk avatar Jul 05 '21 09:07 minrk