DHARMa
DHARMa copied to clipboard
Make DHARMa useable for big data / memory management
Connected to #187 , I wonder if there is any elegant solution to handle the case that simulations do not fit into memory. This is just a reminder for future development, no immediate plans to implement something.
Hi Florian,
I think using the ff
package would be the best way to implement this feature, as it allows a large simulation matrix to be stored on disk and also comes with a set of apply functions to make the performance hit bearable. It would also increase the dependencies by only one package.
I decided to give it a try, based on some code I used to compute quantile residuals for very large models. I settled on the following workflow:
- Introduce a boolean argument
bigData
that indicates whether simulations should be stored on disk. This argument is stored as$bigData
in theDHARMa
object. - The simulatedResponse element of the DHARMa is then initialized as an
ff_matrix
stored in the temporary directory for the current session (as given bytempdir()
). The filenames are random strings prefixed with "dharma". - The simulation matrix is then filled column by column to keep the memory footprint down. Similarly
checkSimulations()
proceeds by column. - For the actual computation of the residuals, the
ffapply()
family of functions is used insidegetQuantiles(bigData = TRUE)
to achieve better performance. - The summary of the simulations calculated inside
testGeneric()
is also produced withffapply()
, if$bigData
of theDHARMa
object is true.
Some more comments and gotchas:
- The R address limit remains 231, i.e. R cannot handle the simulation matrix if it has more than 231 elements, even if stored on disk. I included a corresponding warning, calculating the maximum number of simulations possible given the number of observations of the model. There are ways to circumvent this, but I don't know if it makes sense to implement it in the package (see below)
- The maximum memory footprint depends on the
BATCHBYTES
argument used for theffapply
functions. It currently has to be large enough so that two columns of the simulation matrix fit into memory. I used 230 as a safe choice, but it may be desirable to include some heuristics to set this value according tonObs
. - Since using
refit = TRUE
is generally discouraged and will take even more time with large data sets, I have for now set them as mutually exclusive (i.e. no refit is done ifbigData = TRUE
. - It's a little harder to optimize
method = "traditional"
for memory footprint because of the heuristics involved with checking for the integer response. I cheated a little and had the duplicate test run column-wise, which should be fine if there are enough observations. In addition, generation of the ECDF takes more time (as simulation rows are fetched from disk). In some tests I ran, this becomes very noticeable for larger models, so I included a warning that using the PIT may be a better choice. - Saving the
DHARMa
object (e.g. viasaveRDS()
) does not include the file-based simulation matrix.ffsave()
can be used for the matrix (producing an additional file). It would probably be most convenient to write two helper function for saving and restoring aDHARMa
object. -
bigData
could be a bad name for the argument, as it's not really big data. Maybe something likeonDisk
oroutOfMemory
would make more sense.
Apart from general hardware constraints, I see two use cases (and their combination) for storing the simulations on disk:
- One wants to strongly increase the number of simulations to stabilize the residuals (e.g.
n = 2000
). - One would like to work with very large data sets (millions of observations).
For (1), the solution I outlined works out of the box. But (2) will cause most of the nice functionality of DHARMa, such as plots and tests, to become extremely slow, as computation time mostly scales with number of observations (or residuals). However, I'm not sure if DHARMa shoud even accommodate use case (2). There may not be much overlap between people that work with models of that size, and those that would like to use DHARMa to evaluate them.
A combination of (1) and (2) would almost certainly run into R's address limit. This is currently the case for some models I'm working with (107 observations with 1000 simulations) and it can be dealt with by splitting the simulation matrix column-wise into several objects, that are gathered as a list and looped over. But then again, I don't intend to use DHARMa with these models, and I doubt that anybody would in this case -- so implementing this "splitting" functionality is probably not worth the effort.
I'll set up a pull request so you can have a look at the code.
Hi Daniel,
many thanks for this, looking great on a first glace. I have to apologise in advance for probably not being able to respond to this immediately, I am super busy this week, will try to look at the PR as soon as possible.
Best, Florian