dispy icon indicating copy to clipboard operation
dispy copied to clipboard

dispyscheduler - Could not send node status to

Open Seraphli opened this issue 7 years ago • 26 comments

I am trying to use dispyscheduler. And it seems a bit different from JobCluster.

import dispy


def compute():
    import random
    import time
    t = random.randint(1, 3)
    time.sleep(t)
    return t


cluster = dispy.SharedJobCluster(compute, ip_addr="192.168.5.150",
                                 ext_ip_addr="192.168.5.150",
                                 scheduler_node="192.168.5.190")
job = cluster.submit()
print(job())
cluster.print_status()
import time

time.sleep(1)
cluster.close()

Here is my code. When I run it, the program exit with code 0. Strange is, the files create for storing fault recovery information didn't be removed. So I enabled debug message and the message 2018-04-20 18:57:49 dispyscheduler - Could not send node status to 192.168.5.150:51347 showed when I close the cluster. I don't whether this is a bug or not. Because when I'm using JobCluster, the files will be auto removed when the job is done. But SharedJobCluster don't act like the JobCluster. I'm using version 4.8.6.

Seraphli avatar Apr 20 '18 11:04 Seraphli

The message about Could not send node status at the end is harmless: The scheduler sends status messages to client. While cluster is alive, these messages are received by the client. However, when cluster is closed, the scheduler still sends these messages but client can't receive them, giving that warning.

I will take a look at why fault recover files are not removed later. Can you confirm that you are using latest release (4.8.6)?

pgiri avatar Apr 20 '18 11:04 pgiri

Yes, I update dispy today to see if this is related to the version. Actually I was facing another issue these days. When I run my code on the nodes using JobCluster for about three hours or more, sometimes nodes will break down. The console keep printing invalid replies from nodes and while the code is still running. I don't know why this happened. I used a script to check if there's a problem with the router but everything seems fine.

Seraphli avatar Apr 20 '18 12:04 Seraphli

I assume "nodes will break down" means the client and nodes lose connection? This can happen if "pulse" messages (exchanged at pulse_interval period) are lost. If 5 consecutive pulse messages are lost, nodes assume client has crashed and close computation. If you want to test, you can use smaller pulse_interval(say, 5 seconds) option to JobCluster.

Note that if client loses connection to node, it is possible to recover the results of the jobs later with dispy.recover_jobs function.

pgiri avatar Apr 20 '18 12:04 pgiri

Here is my command to start scheduler. dispyscheduler.py -i 192.168.5.190 --ext_ip_addr 192.168.5.190 --pulse_interval 2 And my command to start one node. dispynode.py --clean -i 192.168.5.203 --ext_ip_addr 192.168.5.203 And still, the problem exists. Here is the output of my code. You can see the error cause dispy to continually output messages which overwhelm the console. And the code is still running according to tqdm processbar. Any advise how to avoid this situation?

2018-04-20 22:13:41 pycos - version 4.6.5 with epoll I/O notifier
2018-04-20 22:13:41 dispy - dispy client version: 4.8.6
2018-04-20 22:13:41 dispy - Storing fault recovery information in "_dispy_20180420221341"
100%|█████████▉| 7999/8000 [07:46<00:00, 17.24it/s]100%|██████████| 8000/8000 [07:46<00:00, 17.14it/s]
(0, 71.665171388328645)
100%|██████████| 8000/8000 [09:07<00:00,  8.30it/s]
(1, 70.784914254711978)
100%|██████████| 8000/8000 [09:41<00:00,  7.14it/s]
(2, 68.904881537609029)
  9%|▉         | 756/8000 [01:04<09:12, 13.12it/s]2018-04-20 22:41:30 dispy - Creating job for "((<stds.env_click.EnvClick object at 0x7fc2e27162b0>, [(43, 30), (42, 37), (45, 39), (45, 31), (35, 28)], 200),)", "{}" failed with "Traceback (most recent call last):
  File "/home/seraphli/Env/nkrf354/lib/python3.5/site-packages/dispy/__init__.py", line 2914, in submit_node
    sock.send_msg(b'JOB:' + serialize(req))
  File "/home/seraphli/Env/nkrf354/lib/python3.5/site-packages/pycos/__init__.py", line 858, in _sync_send_msg
    return self._sync_sendall(struct.pack('>L', len(data)) + data)
  File "/home/seraphli/Env/nkrf354/lib/python3.5/site-packages/pycos/__init__.py", line 668, in _sync_sendall
    sent = self._rsock.send(buf, *args)
BrokenPipeError: [Errno 32] Broken pipe
"
 10%|▉         | 793/8000 [01:05<35:23,  3.39it/s]2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:30 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 805/8000 [01:06<28:10,  4.26it/s]2018-04-20 22:41:31 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 811/8000 [01:06<23:05,  5.19it/s]2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:32 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 816/8000 [01:07<20:11,  5.93it/s]2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 820/8000 [01:07<15:28,  7.74it/s]2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 824/8000 [01:07<11:45, 10.18it/s]2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
 10%|█         | 828/8000 [01:07<09:32, 12.53it/s]2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190
2018-04-20 22:41:33 dispy - Ignoring invalid reply for job 3059434668 from 192.168.5.190

Seraphli avatar Apr 20 '18 15:04 Seraphli

The issue of SharedJobCluster not removing fault recovery file is fixed (in github master).

The above log shows that sending a job failed, although I don't know why errno 32 occurs in that case. Are job arguments large (in terms of size of data, e.g., large array) by any chance?

pgiri avatar Apr 21 '18 04:04 pgiri

Actually, there are more error log messages than I posted. Because the whole log is too long so I just put the very front of it. As the arguments, I don't know for sure now. I assume I don't send large size of data. But I will add a log for the size of arguments to confirm it.

Seraphli avatar Apr 21 '18 04:04 Seraphli

Is the github code up to date? I don't see the latest release 4.8.6, so I thought you mainly update your code on sourceforge.

Seraphli avatar Apr 21 '18 04:04 Seraphli

You can get latest code from github as either git clone or zip.

If the first error is not due to sending job, then ignore my concern about large arguments. It is likely scheduler already closed computation so sending any job will fail. Basically the very first error from client (and scheduler) may point to issue.

pgiri avatar Apr 21 '18 04:04 pgiri

Once you get source from git, you can either copy files to where your current installation is, or you can generate sdist file with python setup.y sdist and then install python -m pip install sdist/dispy-4.8.6.tar.gz --upgrade

pgiri avatar Apr 21 '18 04:04 pgiri

Okay, I will try the new package and test it.

Seraphli avatar Apr 21 '18 05:04 Seraphli

Here is some new information about the problem above. I log every argument size using len(pickle.dumps(args, -1)). It shows before or during the error messages appear, the size of arguments ranges from 49950 to 49993. And this time there is no error about broken pipe.

2018-04-21 13:14:18 pycos - version 4.6.5 with epoll I/O notifier
2018-04-21 13:14:18 dispy - dispy client version: 4.8.6
2018-04-21 13:14:18 dispy - Storing fault recovery information in "_dispy_20180421131418"
100%|██████████| 8000/8000 [07:29<00:00, 17.81it/s]
(0, 73.566876306809064)
100%|██████████| 8000/8000 [08:18<00:00, 16.06it/s]
(1, 69.459179827996323)
100%|██████████| 8000/8000 [09:07<00:00, 14.60it/s]
(2, 67.905599361444729)
100%|██████████| 8000/8000 [09:58<00:00, 13.36it/s]
(3, 68.03278962936875)
100%|██████████| 8000/8000 [11:09<00:00, 13.12it/s]
(4, 64.303740290717101)
  3%|▎         | 256/8000 [00:37<1:40:33,  1.28it/s]  3%|▎         | 269/8000 [00:37<1:10:35,  1.83it/s]2018-04-21 14:01:11 dispy - Ignoring invalid reply for job 3047148908 from 192.168.5.190
2018-04-21 14:01:11 dispy - Ignoring invalid reply for job 3047148140 from 192.168.5.190
2018-04-21 14:01:11 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▎         | 286/8000 [00:38<51:56,  2.48it/s]  2018-04-21 14:01:12 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:12 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▎         | 292/8000 [00:39<44:35,  2.88it/s]2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▎         | 296/8000 [00:39<32:55,  3.90it/s]2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 300/8000 [00:39<24:14,  5.29it/s]2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:13 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 304/8000 [00:40<19:24,  6.61it/s]2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 307/8000 [00:40<15:09,  8.46it/s]2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 310/8000 [00:40<11:56, 10.73it/s]2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 313/8000 [00:40<10:04, 12.72it/s]2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 316/8000 [00:40<09:49, 13.03it/s]2018-04-21 14:01:14 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 319/8000 [00:41<16:32,  7.74it/s]2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 321/8000 [00:42<21:01,  6.09it/s]2018-04-21 14:01:15 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 323/8000 [00:42<17:30,  7.31it/s]2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 325/8000 [00:42<15:20,  8.33it/s]2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 328/8000 [00:42<12:01, 10.64it/s]2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 330/8000 [00:42<13:44,  9.30it/s]2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 333/8000 [00:42<11:06, 11.49it/s]2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:16 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 335/8000 [00:43<15:05,  8.47it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 337/8000 [00:43<15:52,  8.04it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 340/8000 [00:43<13:09,  9.70it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 342/8000 [00:43<12:32, 10.18it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 344/8000 [00:43<10:50, 11.77it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 348/8000 [00:44<08:56, 14.27it/s]2018-04-21 14:01:17 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 350/8000 [00:44<09:23, 13.58it/s]2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 352/8000 [00:44<13:41,  9.30it/s]2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 354/8000 [00:45<19:04,  6.68it/s]2018-04-21 14:01:18 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 356/8000 [00:45<15:51,  8.03it/s]2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  4%|▍         | 358/8000 [00:45<14:52,  8.56it/s]2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  5%|▍         | 361/8000 [00:45<12:50,  9.92it/s]2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  5%|▍         | 364/8000 [00:45<11:31, 11.05it/s]2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190
  5%|▍         | 367/8000 [00:45<09:48, 12.98it/s]2018-04-21 14:01:19 dispy - Ignoring invalid reply for job 3047149420 from 192.168.5.190

Here is the log output by the scheduler .

2018-04-21 14:01:08 dispyscheduler - Could not send reply for job 3047147116 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047147116"
2018-04-21 14:01:08 dispyscheduler - Could not send reply for job 3047145836 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047145836"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047146988 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047146988"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047147948 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047147948"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047146348 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047146348"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047147628 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047147628"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3060027884 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3060027884"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047146796 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047146796"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047148076 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047148076"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047149228 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047149228"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047147564 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047147564"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047149036 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047149036"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047149420 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047149420"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047147052 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047147052"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047149356 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047149356"
2018-04-21 14:01:09 dispyscheduler - Could not send reply for job 3047145644 to 192.168.5.150:51347; saving it in "/tmp/dispy/scheduler/192.168.5.150/random_rollout_with_click_r6wcl6fa/_dispy_job_reply_3047145644"

And two nodes output these:

2018-04-20 22:12:46 pycos - ignoring resume for !timer_proc/140447388834168: 4
2018-04-20 22:12:46 pycos - ignoring resume for !timer_proc/140511972806160: 1

Seraphli avatar Apr 21 '18 06:04 Seraphli

Can you compress each log file and send each as separate email to me directly? The problem seems to have happened before first 'Ignoring invalid reply'. The size of about 50KB (args) shouldn't be a problem.

pgiri avatar Apr 21 '18 14:04 pgiri

Okay. Which logs do you need exactly? I didn't start dispy with -d, so the output is copied from the terminal. Where should I find the log files?

Seraphli avatar Apr 21 '18 14:04 Seraphli

You can save log files by redirecting (may want to use 'tee' as well to know the problem happened to stop); e.g., 'dispynode -d | tee /tmp/dispynode.log'. Please send logs for client, scheduler and node.

Does it happen if you use JobCluster (instead of SharedJobCluster)? If so, it may make it easier to figure out the problem.

pgiri avatar Apr 21 '18 14:04 pgiri

Yeah, the same problem happened when I used JobCluster. Actually, because JobCluster had this problem first, so I try SharedJobCluster and hope this would solve the problem. So I will need to run the node and the code again to get the debug output. It will take some time.

Seraphli avatar Apr 21 '18 14:04 Seraphli

Strange, when I put tee after dispynode, I can't use any node. After I remove tee, just start dispynode, I can use the node again.

Seraphli avatar Apr 21 '18 15:04 Seraphli

I haven't used it that way, but I think when you use with tee, it runs as daemon, so no input is possible. However, it should serve fine. At the end you may have to kill it separately instead of "quit" command (as input is not possible).

pgiri avatar Apr 21 '18 15:04 pgiri

I can use commands to control the node, but I can't get job reply from the node. I can see the node get the job and run it, but I can't get the reply while using tee.

Seraphli avatar Apr 21 '18 15:04 Seraphli

Let me test and get back to you.

pgiri avatar Apr 21 '18 15:04 pgiri

Just did a test and it works fine. Note that with 'tee', log is buffered so you may not see output until more output is produced. Also, you can give commands like "quit"; you just don't see prompt until more output is produced due to buffering.

pgiri avatar Apr 21 '18 15:04 pgiri

Still can't get any reply while using tee. Perhaps it is due to tmux? I'm using following command to start node. dispyscheduler.py -i 192.168.5.190 --ext_ip_addr 192.168.5.190 --pulse_interval 2 -d | tee ~/dispy.log dispynode.py --clean -i 192.168.5.201 --ext_ip_addr 192.168.5.201 -d | tee ~/dispy.log dispynode.py --clean -i 192.168.5.202 --ext_ip_addr 192.168.5.202 -d | tee ~/dispy.log

Seraphli avatar Apr 21 '18 15:04 Seraphli

It's not about tmux. I'm using Ubuntu 16.04.

Seraphli avatar Apr 21 '18 15:04 Seraphli

As discussed above, how about not using SharedJobCluster (and dispyscheduler)?

pgiri avatar Apr 21 '18 15:04 pgiri

The problem still exists. I have tried it. I'm using another way to create log file.

Seraphli avatar Apr 21 '18 15:04 Seraphli

I send an email to your yahoo account. Please check your mailbox.

Seraphli avatar Apr 21 '18 17:04 Seraphli

I attempt to solve the socket.timeout problem myself. I forked your pycos project and did some modification. https://github.com/Seraphli/pycos/tree/fix/timeout Although this implement is ugly and sometime it still throws out exceptions, it can keep the code running and avoid the Ignoring invalid reply for job output.

Seraphli avatar Apr 24 '18 02:04 Seraphli