doRedis icon indicating copy to clipboard operation
doRedis copied to clipboard

removeJob causes resubmission of another task which is peacefully running

Open telenskyt opened this issue 4 years ago • 2 comments

removeJob() causes "Lost result: resubmitting task" in another task which is already running, with no problem - so it will actually run twice!

Steps to reproduce (on Linux; doRedis version 2.0.0)

Step 1 - in a separate R session, run server workers:

queue <- 'jobs3'
server <- '10.0.67.2'

require(doRedis)
registerDoRedis(queue, server) 
startLocalWorkers(8, queue, server)

Step 2 - in two separate R sessions, run two different "masters" (clients):

queue <- 'jobs3'
server <- '10.0.67.2'

require(foreach)
require(doRedis)
registerDoRedis(queue, server) 

PID <- Sys.getpid()
cat("Master PID = ", PID, " running.\n")

foreach (i = 1:2) %dopar% {
	while (TRUE) { A <- rnorm(100000); Sys.sleep(0.05) } # eath some CPU, so that we can easily see it, but not too much
}

Step 3 - Now, suppose you realize that you did a mistake running one of those clients and you want to stop it - just one of them. You start by killing one of those clients (kill [PID]) or, when on console, Ctrl-Z, jobs -l and kill).

Step 4 - removeJob(). Say that the killed master PID was 10863, and the one still running is 10837. The workers are still running.

> tasks()
  queue    id     user           master                time  iter               host  pid
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 1...1 telensky-vypocty-3 8191
2 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 1...1 telensky-vypocty-3 8185
3 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 2...2 telensky-vypocty-3 8211
4 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 2...2 telensky-vypocty-3 8198
> jobs()
  queue    id     user             host                    time
1 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22.976
2 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:14.986
> x <- jobs()[1,]
> x
  queue    id     user             host                    time
1 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22.976
> removeJob(x)

Now, wait a few seconds.... and then:

> jobs()
  queue    id     user             host                    time
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:14.986
> tasks()
  queue    id     user           master                time  iter               host  pid
1 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 1...1 telensky-vypocty-3 8191
2 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:30:55 2...2 telensky-vypocty-3 8251
3 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 1...1 telensky-vypocty-3 8185
4 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:22:15 2...2 telensky-vypocty-3 8211
5 jobs3 10863 telensky telensky-vypocty 2020-04-10-23:22:22 2...2 telensky-vypocty-3 8198
6 jobs3 10837 telensky telensky-vypocty 2020-04-10-23:30:55 1...1 telensky-vypocty-3 8231

You see that the OK task 10837 got resubmitted while it is still computing, so now it is computing twice! The master 10837 reports:

Warning in doTryCatch(return(expr), name, parentenv, handler) :
  Lost result: resubmitting task 1
Warning in doTryCatch(return(expr), name, parentenv, handler) :
  Lost result: resubmitting task 2

One could be tempted to argue that killing the master is "not OK", but:

  1. I don't see any other way how to stop a job - and some way to stop computing job (while keep the others running) is necessary.
  2. System should be robust enough and fault tolerant - should work even if one of the master process dies, for whatever reason.

telenskyt avatar Apr 10 '20 21:04 telenskyt

And note that if you omit the Step 3 (killing the master), and go straight for removeJob, the removed job will be added back to the queue and all tasks will be submitted again, while the old workers are still running! So both jobs will run twice!

See - after Step 2:

> tasks()
  queue    id     user           master                time  iter               host  pid
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 1...1 telensky-vypocty-3 8339
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 2...2 telensky-vypocty-3 8333
3 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 2...2 telensky-vypocty-3 8347
4 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 1...1 telensky-vypocty-3 8389
> jobs()
  queue    id     user             host                    time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237

Now, going to Step 4, say we want to stop job 10893:

> x <- jobs()[2,]
> x
  queue    id     user             host                    time
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237
> removeJob(x)
>
> jobs()
  queue    id     user             host                    time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221

... after few seconds:

> jobs()
  queue    id     user             host                    time
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51.221
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54.237
> tasks()
  queue    id     user           master                time  iter               host  pid
1 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 1...1 telensky-vypocty-3 8339
2 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:47:55 2...2 telensky-vypocty-3 8379
3 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:47:55 1...1 telensky-vypocty-3 8356
4 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:47:53 2...2 telensky-vypocty-3 8399
5 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 2...2 telensky-vypocty-3 8333
6 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:46:51 2...2 telensky-vypocty-3 8347
7 jobs4 10899 telensky telensky-vypocty 2020-04-10-23:47:53 1...1 telensky-vypocty-3 8369
8 jobs4 10893 telensky telensky-vypocty 2020-04-10-23:46:54 1...1 telensky-vypocty-3 8389

Everything runs twice now!

telenskyt avatar Apr 10 '20 21:04 telenskyt

The design is only worker fault tolerant, really. It's pretty clearly discussed in the vignette, so the discussion about step 2 is more or less expected behavior.

But it looks like a different problem you've found with removeJob, and that should be investigated and fixed. Any ideas?

bwlewis avatar Dec 21 '20 16:12 bwlewis