dispy
dispy copied to clipboard
dispynode failure crash
using dispy 4.6.12
While dispynode is running suddenly it crashes and doesn't send results back to Cluster anymore. This error doesn't happen always. looks like a race condition but have no idea what might be wrong (but it always happens at the same line (import nose):
2016-03-23 21:57:46,694 - dispynode - New job id 329691808 from 192.168.13.58/192.168.13.58
2016-03-23 21:57:46,694 - dispynode - New job id 329691928 from 192.168.13.58/192.168.13.58
Exception in thread Thread-5:
Traceback (most recent call last):
File "C:\Anaconda2\lib\threading.py", line 801, in __bootstrap_inner
self.run()
File "C:\Anaconda2\lib\threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Anaconda2\lib\site-packages\gulbis\ext\dispy\dispynode.py", line 1224, in __reply_Q
job_reply = self.reply_Q.get()
File "C:\Anaconda2\lib\multiprocessing\queues.py", line 117, in get
res = self._recv()
File "C:\Anaconda2\lib\site-packages\gulbis\__init__.py", line 1, in <module>
from core.version import *
File "C:\Anaconda2\lib\site-packages\gulbis\core\__init__.py", line 3, in <module>
import sk_test
File "C:\Anaconda2\lib\site-packages\gulbis\core\sk_test.py", line 21, in <module>
from gulbis.core.common import get_run_dir, make_rundata_filepath_from_movie_basename, get_movie_list, get_sk_movie_dir, SUPPORTED_METRIC_OPERATION_TYPES
File "C:\Anaconda2\lib\site-packages\gulbis\core\common.py", line 10, in <module>
import nose
File "C:\Anaconda2\lib\site-packages\nose\__init__.py", line 1, in <module>
from nose.core import collector, main, run, run_exit, runmodule
File "C:\Anaconda2\lib\site-packages\nose\core.py", line 11, in <module>
from nose.config import Config, all_config_files
File "C:\Anaconda2\lib\site-packages\nose\config.py", line 9, in <module>
from nose.plugins.manager import NoPlugins
File "C:\Anaconda2\lib\site-packages\nose\plugins\__init__.py", line 185, in <module>
from nose.plugins.manager import *
File "C:\Anaconda2\lib\site-packages\nose\plugins\manager.py", line 418, in <module>
import pkg_resources
File "c:\anaconda2\lib\site-packages\setuptools-20.3-py2.7.egg\pkg_resources\__init__.py", line 48, in <module>
File "C:\Anaconda2\lib\site-packages\setuptools-20.3-py2.7.egg\pkg_resources\extern\__init__.py", line 43, in load_module
AttributeError: 'NoneType' object has no attribute 'modules'
2016-03-23 21:57:47,601 - dispynode - New job id 329692048 from 192.168.13.58/192.168.13.58
2016-03-23 21:57:47,898 - dispynode - New job id 329692168 from 192.168.13.58/192.168.13.58
Unfortunately, the trace doesn't point to where/what is problem in dispynode. The last point in dispynode is
File "C:\Anaconda2\lib\site-packages\gulbis\ext\dispy\dispynode.py", line 1224, in __reply_Q
job_reply = self.reply_Q.get()
which shouldn't cause crash (multiprocessing.Queue doesn't require locking). Is it likely the problem is elsewhere in imported modules / try latest Python 2.7 (.11 is current)? Apparently there is 'next generation of 'nose' at https://github.com/nose-devs/nose2 which may another thing to try.
Also the log seem to indicate dispynode itself didn't crash, as it accepted new jobs (although the reply processing thread may have crashed due to above). dispy / dispynode / dispyscheduler are designed not to crash (as asyncoro is used) - wherever coroutines are used, it should just continue to work, except for the coroutine that crashed. But in this case, replies are processed with a thread, as the queue is accessed in child processes.
note that i didn't have this problem before when I was using 4.6.0 I have recently upgraded to 4.6.10 then to 4.6.12, only then I started to have this issue.
Im gonna try to reinstall my complete python env and use 4.6.5 (because I need the fix of sending large files results > 1MB) and see if this fixes the problem.
Yes, trying to narrow down to first version between 4.6.5 to 4.6.12 that breaks it should help in fixing it.
I tried with 4.6.5 on a clean python env and I didn't see the problem. Same for 4.6.6, 4.6.7 and 4.6.8.
So up to 4.6.8 we are good.
I am willing to try with the rest of the versions if this doesn't already help.
Thanks; that is useful. I will take a look at changes from 4.6.8 to 4.6.9 over the weekend.
to avoid any misunderstanding; I didn't try 4.6.9, it is possible that it works fine. what I can confirm right now is that up to 4.6.8 (included) I don't see the problem. When I have time I will try 4.6.9, 4.6.10 and 4.6.11 On the other hand 4.6.12 certainly has the problem.
Ah, ok. Let me see if changes from 4.6.8 to 4.6.12 are small enough to isolate. In the meantime, if you can narrow it down further, it would be great. I went through changes quickly and it looks like what might affect this behavior might be changes in 4.6.11. That is, my guess is 4.6.11 would also not work, but 4.6.10 should. If so, you can try dispynode.py from 4.6.10 and rest of dispy from 4.6.11 to confirm that the issue is changes in dispynode in 4.6.11.
Actually, I made a quick patch against current github to fix dispynode.py; can you try this instead with latest github (or 4.6.12 release)?
Yet another attempt at fix: Attached patch is simpler and more efficient, if indeed, your problem is due to changes in 4.6.11. Can you give this one a try? Is it possible for you to construct a simple example that fails for me to test? I have tried a few examples but I don't see it with 4.6.11 or current github, so I am assuming your program is doing something different. Likely you are setting up global variables with 'setup' function that is messing up child process? Anyway, if you can attach an example it would be great.
To comment further on the problem, is it likely you have 'setup' function that is setting 'sys' variable to 'None'? I am guessing at this, as the error in your description
AttributeError: 'NoneType' object has no attribute 'modules'
indicates 'None' being accessed probably where sys module should be. If so, I can understand why change in 4.6.11 might have broken it.
I just tried version 4.6.9 and actually it does break it. So changes from 4.6.8 to 4.6.9 are the root cause. I think you were not expecting this as you thought 4.6.11 is the problematic version. Given this; let me know if you still want me to try the 2 patches or you will need to check again.
To comment on setting the 'sys' variable to 'None'; that is actually not totally true because what I do is clearing the modules loaded in sys:
for m in to_delete:
del sys.modules[m]
Can you attach a simple program that I can try (that exhibits this behavior)? I thought it must be simple, but looks like it is not. It is much easier if I can reproduce (and add to my test suite).
so it is a big piece of code that uses a lot of internal libraries, I tried to create a simple program that exhibits the behavior but didn't manage to. Looks like it is a combination of several factors and you probably won't be able to reproduce it in your local environment.
I will keep trying until I mange to get that program simplified.
In the mean time I can confirm that 4.6.9 is the version that breaks it and hopefully this will lead you somewhere. I am willing to test your fixes/changes/patches in my local environment.
In that case, you can try couple of approaches. First, dispynode.py from 4.6.8 should work with rest of dispy version 4.6.9 (to isolate/confirm issue is in dispynode and not the rest). If that works, the change in dispynode that looks to be causing problem is block of 4 lines around line 1377 that update sys.modules. Can you comment them out in 4.6.9 and try again to confirm if that works?
Can you also describe outline of your program? Are you using 'setup' function to load modules? Where are you executing the block you mentioned above:
for m in to_delete:
del sys.modules[m]
In 'setup' or 'cleanup' or in a job function? What if you don't remove modules from sys.modules? Note that dispynode in 4.6.9 does cleanup properly, so user program doesn't have to do that.
I don't have a setup or cleanup functions. I only use a job core function where everything is happening.
My job function is very simple, it is just a call to another library (which is somewhere running those lines of code that cleanup sys.modules)
I confirm the problem is coming from dispynode changes in 4.6.9 because dispynode-4.6.8 with the rest of dispy-4.6.9 works.
Also by commenting the following lines of code from dispynode-4.6.9 it works:
for module in sys.modules.keys():
if module not in compute.ante_modules:
sys.modules.pop(module, None)
sys.modules.update(self.__init_modules)
That is puzzling that those 4 lines affect your program, as you don't use 'setup' and 'cleanup' (which is the reason for those 4 lines). Can you clarify that this problem occurs after you close a computation (either with 'cluster.close' statement, or your client program terminates) and you submit jobs with a new cluster or program?
Can you try commenting out first 3 lines of that block (that remove modules from 'sys.modules') so only 4th line that updates 'sys.modules' runs? And then try the other way, commenting only 4th line and running first 3 lines?
Sure I will try that.
what is also strange is that it doesn't happen always. I have a script that runs my test 10 times in a row to see the problem happening in one of the 10 runs.
to answer your first question here are the steps I do:
1- broadcast empty function (def empty(): pass) to detect which nodes are alive and ready to respond.
2- get the list of active nodes using my_cluster._cluster._nodes.keys() which I call active_nodes
3- send a setup job to run on each available PC only once. code:
for node in active_nodes:
cluster = dispy.JobCluster(setup_env, callback=on_setup_finished, secret=self.shared_secret, nodes=[str(node)])
job = cluster.submit()
4- send the core job to run in all available CPUs. code:
cluster = dispy.JobCluster(core_job, callback=collect_job_result, secret=self.shared_secret, nodes=active_nodes)
for job_id in core_jobs:
job = cluster.submit()
job.id = job_id
Now the problem occurs at step#4 while distributing jobs and before getting back any results, the thread that sends results back in dispynode crashes. The content of core_job which is a quiet big piece of code does cleanup some of the modules insys.modules at the end.
While dispynode should work with your implementation, I have a few suggestions:
- Cluster's attributes with underscore are not meant for users; they are for implementation only and can change. There are other ways to get the list of nodes. You can use 'cluster_status' callback and process 'DispyNode.Initialized' status (see 'job_scheduler.py' in examples on how to use this feature). Or you can call 'status' method that gives you list of nodes (see 'ClusterStatus' structure). In any case, if you create a dummy cluster first to get available nodes, it is likely that by the time you create second cluster, the nodes may have gone / new ones initialized. You can also use 'NodeAllocate' (see 'FilterNodeAllocate' at http://dispy.sourceforge.net/dispy.html#nodeallocate) to collect nodes discovered.
- Instead of creating a cluster for each node to run setup function, use 'setup' feature. Setting up/initializing is what the 'setup' feature is designed for. Remember to return 0 from the setup function. See 'node_setup.py' in examples on how to use this feature). If you want to use your current approach, make sure to wait for setup job to finish before submitting jobs to a node.
- Instead of creating clusters with nodes set to specific node, you can use 'submit_node' method.
thanks for the suggestions. I will check how to integrate them into my code.
in the meantime I tried to comment the last of that block as you suggested and it crashes. however commenting the first 3 lines doesn't cause any issue.
so the problem is caused by the sys.modules.pop calls and not sys.modules.update
I am guessing that your computation jobs (core_job) depends on modules loaded in setup_env cluster. This would've worked prior to 4.6.9 because modules are not cleaned after a computation is done (as modules used in a computation were never removed after it is done). Since 4.6.9, closing a computation restores state, including modules, to initial state.
As mentioned before, you can use 'setup' function to initialize setup needed for computation (i.e., you can run 'setup_env' function with 'setup' parameter) and then run 'core_job' jobs. With this you don't need to collect available nodes either.
It is also possible to use different clusters as you have done (although this is not ideal). If you create different clusters and don't close them, your current approach would work; i.e.,
setup_clusters = {}
for node in active_nodes:
setup_clusters[str(node)] = dispy.JobCluster(setup_env, callback=on_setup_finished, secret=self.shared_secret, nodes=[str(node)])
job = setup_clusters[str(node)].submit()
job_cluster = dispy.JobCluster(core_job, callback=collect_job_result, secret=self.shared_secret, nodes=active_nodes)
for job_id in core_jobs:
job = job_cluster.submit()
job.id = job_id
# after job_cluster jobs are done, close setup clusters:
for node in active_nodes:
setup_cluster[str(node)].close()
setup_env copy a shared file in the network to a temp file locally. it is only using standard modules, nothing special. here are the the modules loaded in setup_env: import socket, tempfile, shutil, os
later on in core_job I also load those modules again import socket, tempfile, shutil, os to use them for other stuff.
so If I understood you well; those modules are removed from sys.modules at the end of setup_env. Loading them again later on in core_job is not allowed starting from 4.6.9. This must be a strange behavior no ?
Well, when you create cluster for 'setup_env' in to 'cluster' again and again, and later use same variable for creating 'core_job' cluster, behind the scenes, the 'cluster' is being closed when you replace it with another cluster. This may be interfering with your 'core_job' behavior? Since the trace log doesn't help to point out issue in dispynode, I am guessing at what might be going on and offered possible explanation.
so If I understood you well; those modules are removed from sys.modules at the end of setup_env. Loading them again later on in core_job is not allowed starting from 4.6.9. This must be a strange behavior no ?
Of course, 'core_job' computation can load them again. But it can't assume that those modules are loaded (since they were loaded in 'setup_env' cluster, which has been closed - prior to 4.6.9, even if 'setup_env' closed, those modules would've been loaded in dispynode main program).
Can you update if this problem is resolved, or still having issues?
still having the same issue. At the moment I am still using 4.6.8
I was thinking to rewrite my code to make use of the setup and teardown features of dispy and see if that fixes the problem.
On Mon, Aug 8, 2016 at 1:29 AM, Giridhar Pemmasani <[email protected]
wrote:
Can you update if this problem is resolved, or still having issues?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pgiri/dispy/issues/31#issuecomment-238115841, or mute the thread https://github.com/notifications/unsubscribe-auth/AOWiK5OWyDCjfq1bPGbWIVeI4mnE8NbYks5qdmpygaJpZM4H3ey6 .
Can you update if this is still an issue / close if solved?
I still have to test it with the latest dispy version first. will let you know
On Sat, Apr 1, 2017 at 9:30 PM, Giridhar Pemmasani <[email protected]
wrote:
Can you update if this is still an issue / close if solved?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pgiri/dispy/issues/31#issuecomment-290941917, or mute the thread https://github.com/notifications/unsubscribe-auth/AOWiK1yEumdnVdFk2wV-WbQ0q71a6LmSks5rrqXpgaJpZM4H3ey6 .