doRedis icon indicating copy to clipboard operation
doRedis copied to clipboard

Task gets resubmitted when worker is just working on it; last task was not processed at all

Open telenskyt opened this issue 11 years ago • 8 comments

It happened to me that the first task was re-submitted while worker was working on it.

Looking at the console of a single worker (and also the master), the last task (of the total 4) was not processed at all, while the first task was processed twice. Foreach returned list where the last task was missing (list of length only 3). After the foreach job was done, the worker kept writing this on the console:

Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs Processing job 1 from queue jobs ...

On the master, the code is approximatelly sketched below. I always clean the queue by using special trick (see http://stackoverflow.com/q/25947991/684229)

    removeQueue('jobs')
    registerDoRedis('jobs', redis_server)
    r <- foreach (i = c(226, 229, 230, 246), .errorhandling = 'pass', .verbose = TRUE) %dopar% {
       ... really long code (running JAGS using runjags package)
    }
    removeQueue('jobs') # to clean up the queue
    registerDoRedis('jobs', redis_server) # so that worker doesn't stop looping

This is the console output of the verbose foreach loop:

numValues: 4, numResults: 0, stopped: TRUE automatically exporting the following objects from the local environment: bird_dataset_nick, counts, envir, formulas, pop_model, read_poi, species, var_nick Warning in e$fun(obj, substitute(ex), parent.frame(), e$data) : Worker fault, resubmitting task 1. got results for task 1 numValues: 4, numResults: 1, stopped: TRUE returning status FALSE got results for task 1 numValues: 4, numResults: 2, stopped: TRUE returning status FALSE got results for task 2 numValues: 4, numResults: 3, stopped: TRUE returning status FALSE got results for task 3 numValues: 4, numResults: 4, stopped: TRUE calling combine function evaluating call object to combine results: fun(accum, result.1, result.2, result.3) returning status TRUE

I use R 3.1.0, doRedis 1.1.1, rredis 1.6.9 and redis server 2.6.12, all on a single host Windows XP.

I don't know how to reproduce the issue - it happens only sometimes.

EDIT: another case:

Output of foreach on the master:

numValues: 4, numResults: 0, stopped: TRUE automatically exporting the following objects from the local environment: bird_dataset_nick, counts, data, env_formula, envir, euring, f_ind, formula, formulas, no_sites, poi, pop_model, read_poi, sciname, species, species.use, var_nick Warning in e$fun(obj, substitute(ex), parent.frame(), e$data) : Worker fault, resubmitting task 2. got results for task 2 numValues: 4, numResults: 1, stopped: TRUE returning status FALSE got results for task 1 numValues: 4, numResults: 2, stopped: TRUE returning status FALSE got results for task 3 numValues: 4, numResults: 3, stopped: TRUE returning status FALSE got results for task 2 numValues: 4, numResults: 4, stopped: TRUE calling combine function evaluating call object to combine results: fun(accum, result.1, result.2, result.3) returning status TRUE

Again, the results for the last task (4) were not collected. Examining the state of the redis server shows that the results remained there in the queue and can be collected there:

redisKeys("*") [1] "jobs:counter" "jobs:1.results" "jobs:workers"
redisGet("jobs:counter") [1] "3" redisGet("jobs:workers") [1] "2" r <- redisLRange("jobs:1.results", 0, -1) save(r, file = "jobs_1.results.Rdata") str(r) List of 1 $ :List of 1 ..$ 4:List of 7 .. ..$ ...

telenskyt avatar Sep 20 '14 11:09 telenskyt

I have the same problem, is there a solution or some kind of work around yet?

thomaskisler avatar Apr 25 '16 15:04 thomaskisler

Yes, this should be fixed in the development version on GitHub, have you tried that?

devtools::install_github("bwlewis/doRedis")

(be sure to install the new version of doRedis on all the computers participating in the culster)

Let me know if you still have problems with that version. It is hopefully very close to ready for a new CRAN release...

bwlewis avatar Apr 25 '16 16:04 bwlewis

Yes, my test case now works with the development version! Many thanks for your quick response and generally for the doRedis package! It has helped me quite a bit!

thomaskisler avatar Apr 26 '16 14:04 thomaskisler

@thomaskisler you have any testcase? Please post it! I have not been able to reproduce the issue! Thanks, Tomas

telenskyt avatar Apr 26 '16 14:04 telenskyt

@tomastelensky Here is my code. I think the key to being able to reproduce the error is setting a password for the redis-server. This took me quite a while to figure out, after the error did not appear anymore after making the script shorter. So you have to add something like the following to your redis configuration file (in case you are not already doing this):

requirepass ThePasswordIsImportantForReproduction

My master R script looks the following, that submits 8 tasks to the database:

library(foreach)
library(doRedis)
library(uuid)

options('redis:num'=TRUE) #fixing the "invalid format '%.0f'; use format %s for character object" bug -> http://stackoverflow.com/questions/31939951/doredis-on-windows-7-gives-error-as-soon-as-foreach-loop-is-run

REDIS_PASSWORD = "ThePasswordIsImportantForReproduction"
REDIS_PORT = 6379

registerDoRedis("jobs",password=REDIS_PASSWORD,port=REDIS_PORT)

tmpDirectory = '/tmp/' #tempdir()
setwd(tmpDirectory)

print(paste("Saving stuff to",tmpDirectory))

taskDF <- data.frame(id=seq(1,8,by=1))

foreach(j = 1:dim(taskDF)[1], 
        .export=c("tmpDirectory","taskDF"),
        .packages = c("uuid")
        ) %dopar% {
  #get the task from the task data frame
  currTask = taskDF[j,]

  #getting something unique
  uniqPart = UUIDgenerate()

  Sys.sleep(180) #instead of the loading/processing

  # write file for each iteration with unique part, so every thread saves it's own file(s)
  currFileToWrite = paste(tmpDirectory,"/","task-",currTask,"-completed","-",uniqPart,".csv",sep ="")
  print(paste("Writing to file:",currFileToWrite,"-",date()))
  write.csv("Dummy content", currFileToWrite, row.names = F)

  return(T)
}

and the worker script looks like this:

library(doRedis)

REDIS_PASSWORD = "ThePasswordIsImportantForReproduction"
REDIS_PORT = 6379

startLocalWorkers(n=2, queue="jobs",password = REDIS_PASSWORD,port = REDIS_PORT)

I hope this helps!

thomaskisler avatar Apr 28 '16 09:04 thomaskisler

Thanks @thomaskisler! This is very interesting, because I did not use any redis password when the issue came up!! And it came up often. Not in a predictable way, but quite often.

telenskyt avatar Apr 28 '16 15:04 telenskyt

Just for future reference: yes, it is possible that this are two unrelated problems. In my case, the error gets resolved with the new development version.

thomaskisler avatar May 03 '16 12:05 thomaskisler

@bwlewis, in 2016 you wrote "should be fixed in the development version". What is the current status of this issue? Is this fix in version 2.0.0?

PS: I guess you're the one who's supposed to close it according to the workflow once it's fixed, or is it waiting for me to verify? (I am not so experienced in github)

telenskyt avatar Apr 08 '20 14:04 telenskyt