doRedis
doRedis copied to clipboard
removeJob causes resubmission of another task which is peacefully running
removeJob() causes "Lost result: resubmitting task" in another task which is already running, with no problem - so it will actually run twice!
Steps to reproduce (on Linux; doRedis version 2.0.0)
Step 1 - in a separate R session, run server workers:
queue <- 'jobs3'
server <- '10.0.67.2'
require(doRedis)
registerDoRedis(queue, server)
startLocalWorkers(8, queue, server)
Step 2 - in two separate R sessions, run two different "masters" (clients):
queue <- 'jobs3'
server <- '10.0.67.2'
require(foreach)
require(doRedis)
registerDoRedis(queue, server)
PID <- Sys.getpid()
cat("Master PID = ", PID, " running.\n")
foreach (i = 1:2) %dopar% {
while (TRUE) { A <- rnorm(100000); Sys.sleep(0.05) } # eath some CPU, so that we can easily see it, but not too much
}
Step 3 - Now, suppose you realize that you did a mistake running one of those clients and you want to stop it - just one of them. You start by killing one of those clients (kill [PID]
) or, when on console, Ctrl-Z, jobs -l
and kill
).
Step 4 - removeJob(). Say that the killed master PID was 10863, and the one still running is 10837. The workers are still running.
> tasks()
queue id user master time iter host pid
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 1...1 telensky-vypocty-3 8191
2 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 1...1 telensky-vypocty-3 8185
3 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 2...2 telensky-vypocty-3 8211
4 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 2...2 telensky-vypocty-3 8198
> jobs()
queue id user host time
1 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22.976
2 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:14.986
> x <- jobs()[1,]
> x
queue id user host time
1 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22.976
> removeJob(x)
Now, wait a few seconds.... and then:
> jobs()
queue id user host time
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:14.986
> tasks()
queue id user master time iter host pid
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 1...1 telensky-vypocty-3 8191
2 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:30:55 2...2 telensky-vypocty-3 8251
3 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 1...1 telensky-vypocty-3 8185
4 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 2...2 telensky-vypocty-3 8211
5 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 2...2 telensky-vypocty-3 8198
6 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:30:55 1...1 telensky-vypocty-3 8231
You see that the OK task 10837 got resubmitted while it is still computing, so now it is computing twice! The master 10837 reports:
Warning in doTryCatch(return(expr), name, parentenv, handler) :
Lost result: resubmitting task 1
Warning in doTryCatch(return(expr), name, parentenv, handler) :
Lost result: resubmitting task 2
One could be tempted to argue that killing the master is "not OK", but:
- I don't see any other way how to stop a job - and some way to stop computing job (while keep the others running) is necessary.
- System should be robust enough and fault tolerant - should work even if one of the master process dies, for whatever reason.
And note that if you omit the Step 3 (killing the master), and go straight for removeJob
, the removed job will be added back to the queue and all tasks will be submitted again, while the old workers are still running! So both jobs will run twice!
See - after Step 2:
> tasks()
queue id user master time iter host pid
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 1...1 telensky-vypocty-3 8339
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 2...2 telensky-vypocty-3 8333
3 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 2...2 telensky-vypocty-3 8347
4 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 1...1 telensky-vypocty-3 8389
> jobs()
queue id user host time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237
Now, going to Step 4, say we want to stop job 10893:
> x <- jobs()[2,]
> x
queue id user host time
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237
> removeJob(x)
>
> jobs()
queue id user host time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221
... after few seconds:
> jobs()
queue id user host time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237
> tasks()
queue id user master time iter host pid
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 1...1 telensky-vypocty-3 8339
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:47:55 2...2 telensky-vypocty-3 8379
3 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:47:55 1...1 telensky-vypocty-3 8356
4 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:47:53 2...2 telensky-vypocty-3 8399
5 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 2...2 telensky-vypocty-3 8333
6 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 2...2 telensky-vypocty-3 8347
7 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:47:53 1...1 telensky-vypocty-3 8369
8 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 1...1 telensky-vypocty-3 8389
Everything runs twice now!
The design is only worker fault tolerant, really. It's pretty clearly discussed in the vignette, so the discussion about step 2 is more or less expected behavior.
But it looks like a different problem you've found with removeJob, and that should be investigated and fixed. Any ideas?