aggregate_covariance infinite loop

Open cdoersch opened this issue 9 years ago • 0 comments

A few people have emailed me about this now...

When the code starts running, you may see something like this:

aggregate_covariance: 0+0/1500
working workers: 
0 idle.
aggregate_covariance: 0+0/1500
working workers: 
0 idle.
aggregate_covariance: 0+0/1500
working workers: 
0 idle.
aggregate_covariance: 0+0/1500
working workers: 
0 idle.
aggregate_covariance: 0+0/1500
working workers: 
0 idle.
...

Note that if "working workers:" has numbers beside it, e.g. like:

working workers: 1 2 3 4

This means that the code is running fine! aggregate_covariance can take a while. If you want to monitor it, you can look at the worker logs in dswork's output directory in ds/sys/distproc/outputXX.log

However, if "working workers:" is blank, as above, it most likely means that the workers have not started properly, and so the main thread will keep waiting for them forever. There are many reasons this can happen. The first place to check is ds/sys/distproc/outputXX.log. If there's an error there, try to fix it. If these files don't exist (which is more likely), it means the workers crashed during startup. There should be a file called ds/sys/distproc/qsubfileXX.sh for each worker. Try running one of those from the command line. If it gives you an error, try to fix it. One common problem is that your shell cannot find matlab. To fix this, the easiest approach is to edit dswork/dsmapredopen.m so that matlabbin points to the matlab exectuable on your system. If this still doesn't work, comment below.

Jun 21 '16 13:06 cdoersch