DHARMa Memory limitations: option to store simulations on disk

Addresses #188

Feb 26 '21 17:02 dschoenig

For testing, this repurposes the example in the vignette to produce a simulation matrix with 1 billion elements.

testData = createData(sampleSize = 5e5, overdispersion = 1.5, family = poisson())
fittedModel <- glm(observedResponse ~  Environment1 , family = "poisson", data = testData)
res <- simulateResiduals(fittedModel, 2000, method = "PIT", bigData = TRUE)
plot(res)

Computing the residuals took ~260s for me on a laptop with SSD.

Feb 26 '21 18:02 dschoenig

Hi Daniel,

I'm just checking open issues and wanted to say first of all many thanks, I appreciate this PR, and I had of course seen it and but didn't reply yet because I hadn't thought about it in more detail.

Making DHARMa fit for big data is certainly useful, just that I'm not quite sure if doing this via the storage is the way to go. An alternative may simply to force grouped residuals, i.e. implement the option in recalculateResiduals direction in simulateResiduals.

The reason is that I'm not quite sure how helpful it will be to have 1 billon residuals, if all the tests / plots crash downstream.

Any thoughts on that?

Cheers, Florian

Jun 24 '21 11:06 florianhartig

Hi Florian,

No worries, I needed the ff implementation for a project so it wasn't too much work to put some of this into DHARMa.

I completely agree with you that there's no point in having a large number of residuals if most of the functionality of the package will be compromised.

I personally think there may be two viable alternatives, aggregation (as with grouped residuals) or sub-sampling. I think both would be possible with the functionality already provided by recalculateResiduals (group and sel arguments, respectively), is that correct?

Both would probably require setting (internally) some kind of permitted maximum number of residuals that would still allow to use all / most of the tests, etc. that the package provides.

When using aggregation, it would then probably come down to finding a sensible default grouping. I think the model predictions could be ranked, and based on their rank they could be assigned to equal-size bins, the number of bins being the maximum number of residuals permitted.

For a sub-sampling approach, one could work in a similar fashion (ranking and binning), but choosing considerably fewer bins (e.g. a 20th of the permitted maximum number of residuals) and then sampling enough observations in each bin (20 if keeping with the example) to reach the target number.

I'm also slightly in favor of aggregation, but it may be worthwhile testing both approaches.

And just some additional thoughts in this context:

If the final number of residuals is still too large for a smooth scatter plot, one could switch to a gray-scale heatmap / 2D-histogram for the residuals vs predicted plot.
For the different tests, I assume they scale quite differently with the number of residuals, and I really wouldn't consider myself an experts when it comes to their respective properties. Still, it might be interesting to explore which tests could be bootstrapped or work with a smaller sub-sample of the residuals. I think Moran's I is a good example: Calculations rapidly become prohibitively slow above 100,000 observations, but smaller sub-samples / bootstrapping seem to be doing the trick in many cases.

Jun 30 '21 20:06 dschoenig

DHARMa DHARMa copied to clipboard

Memory limitations: option to store simulations on disk

DHARMa
DHARMa copied to clipboard