dispy
dispy copied to clipboard
Any additional parameter when dispatching large jobs?
Hello folks,
I'm dispatching jobs which can take several days to finish. I was wondering if I have to start nodes with any additional paramater (timeout maybe?) in order to ensure that, when jobs are finished, scheduler will be there to take them.
Best,
Alberto.
If there are no network interruptions, it should all work. Even if there are network issues arise, dispy can redistribute computations if reentrant=True option is given (if the computations can be abandoned and executed with same arguments elsewhere). And in case client / scheduler crash, nodes will finish scheduled jobs. The results of these jobs can be retrieved with dispy.recover_jobs function later.
Any updates on this?
Sorry for the late answer.
I have finally obtained results. However, I have some concerns.
I dispatched 60 jobs over a 64 cores cluster (4 computers of 16 cores each). Therefore, 60 jobs were created.
On my local computer, one of those jobs takes 5 and a half days to finish. I expected that to be the total time over the cluster since the 60 jobs were taking place at once. However, it took like 3 weeks to be completed.
I know that 16 tasks in parallel imply more resources usage than only one job, but not delay the process that much.
Am I doing something wrong?
Can you run 'sample.py' job with 60 jobs and see how much it takes / how they execute. Once jobs are completed, you can see start time, end time of execution as well as node's IP address where job executed etc. for diagnostic purposes. You can also monitor cluster / computation progress in a browser (see 'httpd_example.py').
Can you update if you have been able to get it to work as expected?
I have been having some problems using dispy last few days.
Everything was working as expected, but now I am not able to compute anything (not even the canonical computation example).
error: [Errno 32] Insufficient data: 0 / 4
errorTraceback (most recent call last)
<ipython-input-36-b7bab51ec3a2> in <module>()
10 import dispy, random
11 # distribute 'compute' to nodes; 'compute' does not have any dependencies (needed from client)
---> 12 cluster = dispy.SharedJobCluster(compute, scheduler_node='XXXX')
13 # run 'compute' with 20 random numbers on available CPUs
14 jobs = []
/usr/local/lib/python2.7/site-packages/dispy/__init__.pyc in __init__(self, computation, nodes, depends, callback, cluster_status, ip_addr, port, scheduler_node, scheduler_port, ext_ip_addr, loglevel, setup, cleanup, dest_path, poll_interval, reentrant, exclusive, secret, keyfile, certfile, recover_file)
2656 'scheduler_ip_addr': self.scheduler_ip_addr}
2657 sock.send_msg('CLIENT:' + serialize(req))
-> 2658 reply = sock.recv_msg()
2659 sock.close()
2660 reply = deserialize(reply)
/usr/local/lib/python2.7/site-packages/asyncoro/__init__.pyc in _sync_recv_msg(self)
894 raise
895 if len(data) != n:
--> 896 raise socket.error(errno.EPIPE, 'Insufficient data: %s / %s' % (len(data), n))
897 n = struct.unpack('>L', data)[0]
898 # assert n >= 0
error: [Errno 32] Insufficient data: 0 / 4
Note that the scheduler IP is hidden for security reasons.
As soon as I can use it again I will run it and update the status of this issue.
Sorry for the inconveniences I may have caused.