ipyparallel
ipyparallel copied to clipboard
Hang after map_sync is called in loop for a number of iterations
The following script hangs after about ~180 iterations on my machine Number of iterations it gets through seems to be dependent on the time elapsed (hence the sleep)
import numpy as np
import ipyparallel as ipp
c = ipp.Client()
dv = c[:]
dv.execute("import time")
dv.execute("import numpy as np")
v = c.load_balanced_view()
def fun(state):
time.sleep(0.5)
return np.random.normal()
# Generation loop
for it in range(1000):
inputs = range(12)
#v = c.load_balanced_view()
outputs = np.array(v.map_sync(fun, inputs))
print "Iteration", it
Windows 10, Anaconda installation
ipyparallel 6.2.4 py27_0
Cluster is started via ipcluster start -n 4
on powershell
Tried and same problem observed with both python 2.7 and python 3.7
Tried both load balanced and direct views
No error messages and the powershell window for ipcluster becomes unresponsive
I have seen the exact similar behavior. Anyone has an idea?
I also can confirm the same unfortunate failure of ipyparallel
, using the exact code posted above. I am uncertain how to troubleshoot this. Help?
Confirmed on:
Windows 10, ipyparallel 6.2.4
, Python 3.7.6
first run freezes on iteration 30, second run freezes on iteration 174
I am experencing similar hangs without error messages when I call map_sync repeatedly. I'm using Windows 10, Python 3.7.7, ipyparallel 6.2.4.
Hangs are really hard to debug. It is suspicious to me that you have all reported the issue on Windows, which makes me think there is some kind of resource exhaustion / hang that occurs only on Windows which I can never reproduce.
The best way to figure this out is to enable debug logging on all resources:
ipcluster start --debug
and set client.debug = True
before starting. This will produce an enormous amount of output for over 100 iterations, but I don't know how else to debug without being able to reproduce it.
If you do encounter this hang and can interrupt it in an interactive session (e.g. by running in IPython or a debugger). Can one share:
client.queue_status()
client.outstanding
client.history[-24:]
Has anyone seen this not on Windows?
There is a chance this is #294, in which case it should be fixed by #464. I'm not sure, though.
I have exactly the same problem on Linux. CentOS 7.4.1708, Python 3.7.3, ipyparallel 6.3.0
c.queue_status():
{'unassigned': 0,
0: {'queue': 0, 'completed': 398, 'tasks': 0},
1: {'queue': 0, 'completed': 380, 'tasks': 0},
2: {'queue': 0, 'completed': 380, 'tasks': 0},
3: {'queue': 0, 'completed': 379, 'tasks': 1},
4: {'queue': 0, 'completed': 380, 'tasks': 0},
5: {'queue': 0, 'completed': 380, 'tasks': 0},
6: {'queue': 0, 'completed': 380, 'tasks': 0},
7: {'queue': 0, 'completed': 380, 'tasks': 0},
8: {'queue': 0, 'completed': 380, 'tasks': 0},
9: {'queue': 0, 'completed': 380, 'tasks': 0},
10: {'queue': 0, 'completed': 380, 'tasks': 0},
11: {'queue': 0, 'completed': 380, 'tasks': 0}}
c.outstanding
{'36a29353-199402148ca89fb53ab4ee6c_627'}
c.history[-24:]
['36a29353-199402148ca89fb53ab4ee6c_613',
'36a29353-199402148ca89fb53ab4ee6c_614',
'36a29353-199402148ca89fb53ab4ee6c_615',
'36a29353-199402148ca89fb53ab4ee6c_616',
'36a29353-199402148ca89fb53ab4ee6c_617',
'36a29353-199402148ca89fb53ab4ee6c_618',
'36a29353-199402148ca89fb53ab4ee6c_619',
'36a29353-199402148ca89fb53ab4ee6c_620',
'36a29353-199402148ca89fb53ab4ee6c_621',
'36a29353-199402148ca89fb53ab4ee6c_622',
'36a29353-199402148ca89fb53ab4ee6c_623',
'36a29353-199402148ca89fb53ab4ee6c_624',
'36a29353-199402148ca89fb53ab4ee6c_625',
'36a29353-199402148ca89fb53ab4ee6c_626',
'36a29353-199402148ca89fb53ab4ee6c_627',
'36a29353-199402148ca89fb53ab4ee6c_628',
'36a29353-199402148ca89fb53ab4ee6c_629',
'36a29353-199402148ca89fb53ab4ee6c_630',
'36a29353-199402148ca89fb53ab4ee6c_631',
'36a29353-199402148ca89fb53ab4ee6c_632',
'36a29353-199402148ca89fb53ab4ee6c_633',
'36a29353-199402148ca89fb53ab4ee6c_634',
'36a29353-199402148ca89fb53ab4ee6c_635',
'36a29353-199402148ca89fb53ab4ee6c_636']
Thanks for that sample! That suggests that it is not fixed by #464, because that was purely a client-side race.
If this is reliably reproducible for you, can you share the controller's log output as well? Can you also test with the latest 7.0.0a5 in case it happens to be fixed already, even if not by #464?
A complete reproducible example is always hugely helpful, but I realize that's often not feasible for bugs like this one.
I've run this sample locally a few times (macOS 11.5.2, Python 3.9.6, ipyparallel 7.0.0b3), and it completes 1000 iterations without any errors. So I'm going to hope that some of the big refactors in 7.0 have fixed this, possibly also changes in ipykernel 6.
Based on @seekjim20's debug output, the issue is a failure to return one task reply. Since both the client and the Hub agree that the task is not done, it suggests that the message was not delivered to (or not handled properly in) the task scheduler. Checking for the missing msg id (36a29353-199402148ca89fb53ab4ee6c_627) in the task scheduler's debug logs may point to the next step. Or it could have been an error on the engine itself, failing to send the message.