remake icon indicating copy to clipboard operation
remake copied to clipboard

Possible bug repacking large objects (difficult to reproduce)

Open kendonB opened this issue 7 years ago • 22 comments

A couple of times I've generated a large target and got the message Repacking large object.

Afterwards, I get the error Error: hash '7589699895aede1878253021f5f97690' not found and I have to delete the target and build it again and it'll work the second time. I'm guessing something goes wrong with writing one of the rds files.

kendonB avatar Oct 23 '17 08:10 kendonB

I recommend submitting this issue to storr. You get the Repacking large object message and when you try to store something 2 GB or larger. (@richfitz, why did you decide to use saveRDS(unserialize(...))? Does it help write large files faster?) The hash ... not found message also comes from storr. Can you reproduce the error with something like the following?

value <- get_something_over_2_GB()
s <- storr::storr_rds("mystorr")
s$set(key = "large", value = value)
rm(value)
another <- s$get(key = "large")

Then, your data should be one of the RDS files in mystorr/data.

wlandau avatar Oct 23 '17 12:10 wlandau

I tried running your code above. It's been running for several minutes and I assume it's because storr is compressing?

kendonB avatar Oct 25 '17 09:10 kendonB

Something like that, I think, compressing and/or serializing. @richfitz would know more surely than I.

wlandau avatar Oct 25 '17 11:10 wlandau

Sorry, I have been away this week...

This issue is the intersection of a couple of limitations in storr and R and I am hopeful that eventually it will go away. It is definitely a storr issue (and most likely a bug, though it could be remake relying on a storr implementation detail that has changed possibly).

The reason for the peculiar saveRDS(unserialize(...)) construction is:

  • most of the time storr (well, the rds backend) can recieve raw vectors containing the serialised object, saving serialising the object twice - once for hashing and once for saving. This works great for non-large objects. We then save the raw vector directly with writeBin
  • writeBin does not support large vectors! So if you have an object that serialises to >2GB then we can't save it
  • I used to write out the long vector bit-by-bit but that turned out to be painfully slow. It's faster to unserialize it and use saveRDS directly onto a connection (see https://github.com/richfitz/storr/commit/93ee5370b7c4cf303594df2c6f54b3427e492416)

I personally find R intolerably slow with these sorts of objects anyway!

But regardless of speed and weirdness, it should work. If you can create a reproducible example (either in remake or storr) I'll get it fixed

richfitz avatar Oct 30 '17 10:10 richfitz

Hi -

Has this ever been addressed?

I'm running into the same issue. The "Repacking large object" step takes painfully long. Happy to provide more details, but it happens across multiple steps.

klar-C avatar Nov 04 '19 22:11 klar-C

Not for remake, as far as I know. But remake's successor drake now supports specialized data formats that avoid storr entirely. Data frames and data.table objects are particularly fast if you select an fst-based format. Details: https://ropenscilabs.github.io/drake-manual/plans.html#special-data-formats-for-targets

wlandau avatar Nov 05 '19 01:11 wlandau

Okay thanks for the swift reply.

Is it possible to keep the cache within R memory (I mostly want to use drake to manage the complexity of all the steps, rather than being able to restore cache)? Or is drake.hasty the only way to do that at the moment?

klar-C avatar Nov 05 '19 01:11 klar-C

Yes. storr has an environment-based backend, and you can supply it to the cache argument of drake::make(). You can supply one to drake. Just be aware that drake will also store stuff in the .drake/ folder if you select custom data formats or leave history = TRUE in make().

library(drake)
load_mtcars_example()
cache <- storr::storr_environment()

make(my_plan, cache = cache)
#> target large
#> target small
#> target regression2_large
#> target regression1_large
#> target regression2_small
#> target regression1_small
#> target coef_regression2_large
#> target summ_regression2_large
#> target coef_regression1_large
#> target summ_regression1_large
#> target coef_regression2_small
#> target summ_regression2_small
#> target coef_regression1_small
#> target summ_regression1_small
#> target report

readd(summ_regression2_large, cache = cache)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -5.646  -2.649  -0.519   0.000   2.639   6.945

cache$get("summ_regression2_large")
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -5.646  -2.649  -0.519   0.000   2.639   6.945

Created on 2019-11-04 by the reprex package (v0.3.0)

wlandau avatar Nov 05 '19 01:11 wlandau

Thanks that worked.

I'm having now other issues with storr, meaning cache$archive_export (for intermediate workspace saves) gives me Error: hash 'db2ded51d81a4403d8a4bd25fa1e57ee' not found

But I guess that's more something for storr rather than drake.

EDIT: Looks a bit similiar to the issue you pointed out: https://github.com/richfitz/storr/issues/93

klar-C avatar Nov 05 '19 14:11 klar-C

If it's that issue, it's just that the object you're trying to save can't be reliably serialised.

richfitz avatar Nov 05 '19 15:11 richfitz

Okay thanks.

I was able to use the import/export method instead and that worked but now drake doesn't recognize the history anymore (unclear if I should post this in the drake repository at this point).

Basically what I'm trying to do is to run the session in memory and then save it down every 10 steps or so. But I also need to be able to restore it from those saved down sessions. Below code runs rC the second time without recognizing that it is already part of the cache.


# First run.
cache <- storr::storr_environment()
cache2 <- storr::storr_rds('C:/Temp/storr_test')

plan <- drake_plan(
  rC=target(command=(function(){
    cat('Run.\n')
    data.frame(c(1:10))
  })())

)

make(plan, targets=c('rC'),cache=cache,history=FALSE)
cache2$import(cache)

# Get it back in.
.rs.restartR() # (+Cleaned session.)
cache2 <- storr::storr_rds('C:/Temp/storr_test')
cache <- storr::storr_environment()
cache$import(cache2)
cache$list() %>% map(~cache$get(.x))
make(plan, targets=c('rC'),cache=cache,history=FALSE)

klar-C avatar Nov 05 '19 16:11 klar-C

To manage the logistics of special data formats, drake takes its storr and wraps it up in a decorated storr. storr_rds() and storr_environment() give you ordinary storrs, and drake_cache() gives you a decorated storr that can handle special data and history. I adapted $import() and $export() to work with the decorated storr, but $archive_import() and $archive_export() still defer to storr itself. And the history needs to be handled separately through txtq. Example: this workflow imports both the cache and this history here.

Out of curiosity, what advantage do you see in starting with an environment storr and periodically exporting it? Why not start with a file system storr from the beginning?

wlandau avatar Nov 05 '19 16:11 wlandau

My understanding is that drake does not keep any of its cache in memory in case it is a filesystem storr. So by using a file-system-storr I would arrive at the same issue I had at the very beginning ("Repacking large object")? I have various other types (R6, lists) as the result of each step so I unfortunately can't just use fst.

I'm adjusting steps a lot while developing, hence keeping it in memory saves me the load-time (and yet again solves the "repacking"-issue). Each step takes quite a bit to calculate and the resulting files are large - at the same time my machine has enough memory so this seems to be the ideal setup.

Thanks for the link. It looks like drake would be able to restore cache from another filesystem folder, however it doesn't look like it can restore cache into an environment such as storr::storr_environment()? Am I out of luck in case I want to store sessions to the filesystem and at the same time load them back into storr_environment-like sessions?

  my_cache <- drake::drake_cache(path = ".drake")
  ci_cache <- drake::drake_cache(path = file.path(td, ".drake"))
  my_cache$import(ci_cache)
  my_cache$history$import(ci_cache$history)

klar-C avatar Nov 05 '19 17:11 klar-C

BTW I'm seeing some strange behavior when running the in-memory cache and then running one other step (even just the repetition of same steps).

It seems to reload everything (or not sure what it does but it takes longer than the actual calculation steps) and heavily uses memory. Again: If you want me to post this on the main drake thread happy to do so.

image

klar-C avatar Nov 05 '19 18:11 klar-C

My understanding is that drake does not keep any of its cache in memory in case it is a filesystem storr. So by using a file-system-storr I would arrive at the same issue I had at the very beginning ("Repacking large object")? I have various other types (R6, lists) as the result of each step so I unfortunately can't just use fst.

With target(format = "rds"), storr does not repack the object. Instead, drake saves it as an ordinary RDS file and then hashes the file.

Do those R6 objects and lists need to be targets? It is worth taking time to think about what should be a target and what should not: https://ropenscilabs.github.io/drake-manual/plans.html#how-to-choose-good-targets. R6 objects in particular are brittle when it comes to tracking changes. Depending on the functions you write to support the plan, you may be able to achieve targets that are easier to work with.

Thanks for the link. It looks like drake would be able to restore cache from another filesystem folder, however it doesn't look like it can restore cache into an environment such as storr::storr_environment()? Am I out of luck in case I want to store sessions to the filesystem and at the same time load them back into storr_environment-like sessions?

You can use drake:::decorate_storr() to turn an environment storr into a drake decorated storr. Then, importing should work, although I have not tried it myself.

It seems to reload everything (or not sure what it does but it takes longer than the actual calculation steps) and heavily uses memory. Again: If you want me to post this on the main drake thread happy to do so.

This is unsurprising for in-memory caches. And in the general case, drake chooses speed over memory unless you take action as described at https://ropenscilabs.github.io/drake-manual/memory.html.

wlandau avatar Nov 05 '19 19:11 wlandau

And of course, feel free to post an issue to drake's repo.

wlandau avatar Nov 05 '19 19:11 wlandau

Thanks!

Understood and agreed on the R6 point. It was more a convenience for me, rather than a design choice I can't live without. For the lists I'll go with rds.

Regarding the increasing memory: Does that mean that if I run e.g. 10 steps, and then run 1 more, drake will load all the results of the first 10 steps into memory before executing the additional step?Note that the 11th step is only referencing 1-2 results from the previous 10 steps.

I tried to use the make(plan, targets=c('target'),memory_strategy='none') command.

It errors out when not having the dependencies set up with readd (the very first command is using that dependency) - that means that it's doing what it's supposed to do. However it takes a long time before the error pops up and I again get the same increasing memory profile. Is there any other loading done that I would need to turn off?

EDIT: After some tests it looks like restarting it all, and then consistently using make(plan, targets=c('...'),memory_strategy='none',garbage_collection=TRUE) does the trick.

EDIT2: Are there any special underlying steps when an operation fails? Even though I specified all the objects as either rds or fst, when an operation fails I get the below 2 steps and it takes quite a while before the actual error message appears.

fail obj Repacking large object

klar-C avatar Nov 05 '19 21:11 klar-C

Regarding the increasing memory: Does that mean that if I run e.g. 10 steps, and then run 1 more, drake will load all the results of the first 10 steps into memory before executing the additional step?Note that the 11th step is only referencing 1-2 results from the previous 10 steps.

It will only load the direct dependencies that are not already in memory: in your case, the first couple results out of those 10. (Code for the memory management is here).

EDIT: After some tests it looks like restarting it all, and then consistently using make(plan, targets=c('...'),memory_strategy='none',garbage_collection=TRUE) does the trick.

Glad to hear it!

EDIT2: Are there any special underlying steps when an operation fails? Even though I specified all the objects as either rds or fst, when an operation fails I get the below 2 steps and it takes quite a while before the actual error message appears.

The "fail obj" message comes from here. After that, it stores metadata about the error, including a traceback, which should be no more than 4 or 5 KB. I am surprised you are still seeing "repacking large object", especially since you are setting formats. What version of drake are you using?

wlandau avatar Nov 06 '19 02:11 wlandau

I'm using the one from github.

install_github("ropensci/drake")

It honestly seems like it's again doing some loading of memory (meaning I again see that pyramid-like loading into memory). I'll try to build a reproducible example (can't send the actual data I'm working on).

And to be clear: I'm only seeing the "large objects"-message when it errors out.

klar-C avatar Nov 06 '19 12:11 klar-C

One follow-up point:

When using make(plan, targets=c('target'),memory_strategy='none',garbage_collection=TRUE) it sometimes wants to rerun very early points, even though I clearly specify what I want to run.

Is there some way to overwrite this?

cached() clearly shows the objects, but for some reason drake thinks it needs to be rerun.

This applies both to my current case (where it happens somewhat randomly) but also to other cases. E.g. I want to adjust steps (and reload the plan), but I want to keep using old results of steps.

(I understand that drake wants to be careful, and rerun when in doubt, but it is very costly time-wise during development.)

klar-C avatar Nov 06 '19 22:11 klar-C

Would you post follow-up questions to drake's own issue tracker? We are diverging off this thread's original topic.

drake::make() deliberately goes through the upstream dependencies of the targets you select. To jump straight to the target you want and use old outdated dependencies, consider drake_build(). If you think some targets should be up to date but find them invalidated, maybe have a look at deps_profile().

drake::make() is sensitive to the functions you load into your session. This behavior helps drake live and operate entirely within R, which is one of its original design goals. However, it also comes with some brittleness. r_make(), on the other hand, always spins up a predictable new session before calling make(). Details: https://ropenscilabs.github.io/drake-manual/projects.html#safer-interactivity

wlandau avatar Nov 07 '19 02:11 wlandau

Okay will do. Thanks again for the quick replies!

klar-C avatar Nov 07 '19 12:11 klar-C