dispy icon indicating copy to clipboard operation
dispy copied to clipboard

Any additional parameter when dispatching large jobs?

Open bertocast opened this issue 8 years ago • 6 comments

Hello folks,

I'm dispatching jobs which can take several days to finish. I was wondering if I have to start nodes with any additional paramater (timeout maybe?) in order to ensure that, when jobs are finished, scheduler will be there to take them.

Best,

Alberto.

bertocast avatar Feb 23 '17 08:02 bertocast

If there are no network interruptions, it should all work. Even if there are network issues arise, dispy can redistribute computations if reentrant=True option is given (if the computations can be abandoned and executed with same arguments elsewhere). And in case client / scheduler crash, nodes will finish scheduled jobs. The results of these jobs can be retrieved with dispy.recover_jobs function later.

pgiri avatar Feb 23 '17 23:02 pgiri

Any updates on this?

pgiri avatar Mar 05 '17 20:03 pgiri

Sorry for the late answer.

I have finally obtained results. However, I have some concerns.

I dispatched 60 jobs over a 64 cores cluster (4 computers of 16 cores each). Therefore, 60 jobs were created.

On my local computer, one of those jobs takes 5 and a half days to finish. I expected that to be the total time over the cluster since the 60 jobs were taking place at once. However, it took like 3 weeks to be completed.

I know that 16 tasks in parallel imply more resources usage than only one job, but not delay the process that much.

Am I doing something wrong?

bertocast avatar Mar 06 '17 08:03 bertocast

Can you run 'sample.py' job with 60 jobs and see how much it takes / how they execute. Once jobs are completed, you can see start time, end time of execution as well as node's IP address where job executed etc. for diagnostic purposes. You can also monitor cluster / computation progress in a browser (see 'httpd_example.py').

pgiri avatar Mar 06 '17 13:03 pgiri

Can you update if you have been able to get it to work as expected?

pgiri avatar Apr 01 '17 19:04 pgiri

I have been having some problems using dispy last few days.

Everything was working as expected, but now I am not able to compute anything (not even the canonical computation example).

error: [Errno 32] Insufficient data: 0 / 4
errorTraceback (most recent call last)
<ipython-input-36-b7bab51ec3a2> in <module>()
     10     import dispy, random
     11     # distribute 'compute' to nodes; 'compute' does not have any dependencies (needed from client)
---> 12     cluster = dispy.SharedJobCluster(compute, scheduler_node='XXXX')
     13     # run 'compute' with 20 random numbers on available CPUs
     14     jobs = []
/usr/local/lib/python2.7/site-packages/dispy/__init__.pyc in __init__(self, computation, nodes, depends, callback, cluster_status, ip_addr, port, scheduler_node, scheduler_port, ext_ip_addr, loglevel, setup, cleanup, dest_path, poll_interval, reentrant, exclusive, secret, keyfile, certfile, recover_file)
   2656                'scheduler_ip_addr': self.scheduler_ip_addr}
   2657         sock.send_msg('CLIENT:' + serialize(req))
-> 2658         reply = sock.recv_msg()
   2659         sock.close()
   2660         reply = deserialize(reply)
/usr/local/lib/python2.7/site-packages/asyncoro/__init__.pyc in _sync_recv_msg(self)
    894                 raise
    895         if len(data) != n:
--> 896             raise socket.error(errno.EPIPE, 'Insufficient data: %s / %s' % (len(data), n))
    897         n = struct.unpack('>L', data)[0]
    898         # assert n >= 0
error: [Errno 32] Insufficient data: 0 / 4

Note that the scheduler IP is hidden for security reasons.

As soon as I can use it again I will run it and update the status of this issue.

Sorry for the inconveniences I may have caused.

bertocast avatar Apr 05 '17 09:04 bertocast