DHARMa icon indicating copy to clipboard operation
DHARMa copied to clipboard

Make DHARMa useable for big data / memory management

Open florianhartig opened this issue 4 years ago • 2 comments

Connected to #187 , I wonder if there is any elegant solution to handle the case that simulations do not fit into memory. This is just a reminder for future development, no immediate plans to implement something.

florianhartig avatar Jun 29 '20 14:06 florianhartig

Hi Florian,

I think using the ff package would be the best way to implement this feature, as it allows a large simulation matrix to be stored on disk and also comes with a set of apply functions to make the performance hit bearable. It would also increase the dependencies by only one package.

I decided to give it a try, based on some code I used to compute quantile residuals for very large models. I settled on the following workflow:

  • Introduce a boolean argument bigData that indicates whether simulations should be stored on disk. This argument is stored as $bigData in the DHARMa object.
  • The simulatedResponse element of the DHARMa is then initialized as an ff_matrix stored in the temporary directory for the current session (as given by tempdir()). The filenames are random strings prefixed with "dharma".
  • The simulation matrix is then filled column by column to keep the memory footprint down. Similarly checkSimulations() proceeds by column.
  • For the actual computation of the residuals, the ffapply() family of functions is used inside getQuantiles(bigData = TRUE) to achieve better performance.
  • The summary of the simulations calculated inside testGeneric() is also produced with ffapply(), if $bigData of the DHARMa object is true.

Some more comments and gotchas:

  • The R address limit remains 231, i.e. R cannot handle the simulation matrix if it has more than 231 elements, even if stored on disk. I included a corresponding warning, calculating the maximum number of simulations possible given the number of observations of the model. There are ways to circumvent this, but I don't know if it makes sense to implement it in the package (see below)
  • The maximum memory footprint depends on the BATCHBYTES argument used for the ffapply functions. It currently has to be large enough so that two columns of the simulation matrix fit into memory. I used 230 as a safe choice, but it may be desirable to include some heuristics to set this value according to nObs.
  • Since using refit = TRUE is generally discouraged and will take even more time with large data sets, I have for now set them as mutually exclusive (i.e. no refit is done if bigData = TRUE.
  • It's a little harder to optimize method = "traditional" for memory footprint because of the heuristics involved with checking for the integer response. I cheated a little and had the duplicate test run column-wise, which should be fine if there are enough observations. In addition, generation of the ECDF takes more time (as simulation rows are fetched from disk). In some tests I ran, this becomes very noticeable for larger models, so I included a warning that using the PIT may be a better choice.
  • Saving the DHARMa object (e.g. via saveRDS()) does not include the file-based simulation matrix. ffsave() can be used for the matrix (producing an additional file). It would probably be most convenient to write two helper function for saving and restoring a DHARMa object.
  • bigData could be a bad name for the argument, as it's not really big data. Maybe something like onDisk or outOfMemory would make more sense.

Apart from general hardware constraints, I see two use cases (and their combination) for storing the simulations on disk:

  1. One wants to strongly increase the number of simulations to stabilize the residuals (e.g. n = 2000).
  2. One would like to work with very large data sets (millions of observations).

For (1), the solution I outlined works out of the box. But (2) will cause most of the nice functionality of DHARMa, such as plots and tests, to become extremely slow, as computation time mostly scales with number of observations (or residuals). However, I'm not sure if DHARMa shoud even accommodate use case (2). There may not be much overlap between people that work with models of that size, and those that would like to use DHARMa to evaluate them.

A combination of (1) and (2) would almost certainly run into R's address limit. This is currently the case for some models I'm working with (107 observations with 1000 simulations) and it can be dealt with by splitting the simulation matrix column-wise into several objects, that are gathered as a list and looped over. But then again, I don't intend to use DHARMa with these models, and I doubt that anybody would in this case -- so implementing this "splitting" functionality is probably not worth the effort.

I'll set up a pull request so you can have a look at the code.

dschoenig avatar Feb 26 '21 17:02 dschoenig

Hi Daniel,

many thanks for this, looking great on a first glace. I have to apologise in advance for probably not being able to respond to this immediately, I am super busy this week, will try to look at the PR as soon as possible.

Best, Florian

florianhartig avatar Mar 01 '21 13:03 florianhartig