remake icon indicating copy to clipboard operation
remake copied to clipboard

work around for large data objects

Open wcornwell opened this issue 10 years ago • 8 comments

Not really a remake issue, but when reading in a 4.418 GB file, I'm getting this error from digest, probably when it tries to make the hash:

Error in digest::digest(x) : 
  long vectors not supported yet: memory.c:3361

Any ideas for a work around?

wcornwell avatar Dec 31 '15 04:12 wcornwell

I've changed the workflow such that all the very large objects are not in the remake.yml file--they're loaded within functions from files, processed, and the outputs in the remake.yml file are figures.

Now I don't get the digest error, but the whole thing runs many times slower calling remake than it does outside of remake. It take about 5 min outside of remake but still haven't got it to finish using remake. Any idea what's going on? Am I caching something really big accidentally?

The datafile that I'm loading is about 10^8 rows...

wcornwell avatar Jan 03 '16 07:01 wcornwell

This is a limitation in the digest package, by way of a limitation in R's handling of vectors. I can think of a few workarounds but none of them are wonderful. More of an issue is that with the current remake approach you will end up with a copy of that 4GB file if you just read it in, but I don't think that can be avoided.

Options:

  • for very large files we could use tools::md5sum (can you confirm that this works on your problem case?). This would require tryCatch or a call out to file.info for every file check though.
  • Patch digest so that it can support reading in chunks (see http://stackoverflow.com/a/3621316 for examples in another language that at least shows this is possible).
  • Fall back on file size alone or a system utility to compute file sizes?

Another workaround that you might want to try is check: exists in the file target (e.g., here) which will skip the check of a file. I use this on download targets often.

The speed issue is due to constantly coming in and out of disk I think. I would have thought that I'd got rid of most of the issues there with storr, but perhaps not. If you can stick it up somewhere I'm happy to take a look.

richfitz avatar Jan 04 '16 09:01 richfitz

Development versions of digest support long vectors, so this will go away once it is updated on CRAN. When that happens I'll either depend on the new version or add docs to point people at the right place.

richfitz avatar Jan 04 '16 13:01 richfitz

@jscamac and I just encountered this issue again. The issue came up when loading lots mcmc chain outputs with lots of random variables.

remake::make("compiled_rho_comparisons_models")
[  LOAD ] 
[  READ ]                                       |  # loading sources
<  MAKE > compiled_rho_comparisons_models
[ BUILD ] compiled_rho_comparisons_models       |  compiled_rho_comparisons_...
[  READ ]                                       |  # loading packages
Error in writeBin(value, con) : 
  long vectors not supported yet: ../../../../R-3.2.2/src/main/connections.c:4091

Seems digest is struggling to hash the large object ( a list of 240 MCMC chains with 4000 samples and up to 8 variables each). Our workaround was to load & process the chains in a single step. This works because the processed object is much smaller, and so easy to hash. Ordinarily we would create an intermediate target of as there are several downstream targets depending on the chain list.

Anyway, in reaching this solution we tried two of the other proposed workarounds, which I'll document here for posterity:

  1. Updating version of digest: @jscamac tried installed the latest version of digest, which supposedly handles long vectors, but without success
  2. Setting check: exists in remake. This also does not give a suitable workaround, because -- as far as I can tell -- while it prevents remake form checking the hash, it does not prevent the object from getting hashed.

In the function remake_update there is a call

current <- remake_is_current(obj, target_name)

This decides whether the object needs to be built, and taking account setting of check variable.

If it does need to be built (e.g does not previously exist), the following call builds the object:

ret <- target_build(target, obj$store, obj$verbose$quiet_target)

This is where the error appears:

Enter a frame number, or 0 to exit   

 1: remake::make("compiled_rho_comparisons_models")
 2: remake_make(obj, target_names)
 3: remake_make1(obj, t, ...)
 4: remake_update(obj, i, check = check, return_target = is_last)
 5: target_build(target, obj$store, obj$verbose$quiet_target)
 6: target_set(target, store, res)
 7: store$objects$set(target$name, value)
 8: self$set_value(value, use_cache)
 9: self$driver$set_object(hash, value_dr)
10: writeBin(value, con)

Importantly, the object seems to get hashed by digest irrespective of whether the target variable is set as check: exists or not.

dfalster avatar May 03 '16 04:05 dfalster

Thanks for the detailed report.

It actually looks like this is triggering during writing the object out with writeBin (which is itself called by saveRDS(value, self$name_hash(hash), compress=self$compress)). That function is called after the hash is computed so it looks like digest is doing the right thing here.

Will see if I can mock something up in storr...

richfitz avatar May 03 '16 07:05 richfitz

Hmm, I can't reproduce here:

dat <- raw(2^31 + 100)
path <- tempfile()
saveRDS(dat, path, compress=FALSE)

Can you let me know what version of R you're running, in case this is something that R has recently changed (I am running 3.2.4). Otherwise is there a way you can pass me the whole project without having to run all the chains?

richfitz avatar May 03 '16 07:05 richfitz

We were running R version 3.2.2 (2015-08-14) -- "Fire Safety". I'll send you a copy of the folder with chain outputs.

jscamac avatar May 03 '16 13:05 jscamac

Thanks Rich, I totally missed that that issue was not with hashing after all.

I ​confirm that I can reproduce the error on my machine. Instructions for doing this have been posted in slack.

dfalster avatar May 04 '16 02:05 dfalster