merTools icon indicating copy to clipboard operation
merTools copied to clipboard

Refactor predictInterval

Open jknowles opened this issue 5 years ago • 2 comments

The time has come to refactor predictInterval. It always should have been multiple functions since each piece of the prediction interval can be sampled independently. To help make future maintenance easier and unit testing more straightforward, the function should be broken up into the following components:

predict_setup() - does the work of getting the merMod pieces needed to make intervals sample_re() - creates the random component of the prediction intervals sample_fe() - creates the fixed component of the prediction intervals sample_model() - gets the model fit error (if user requests) component of the prediction intervals

Then predictInterval is simply a function that calls these functions and then combines the results into different configurations depending on what the user requests.

I don't think the sub-functions should be exported, predictInterval() should always be the function the user interacts with even when they only request part of the prediction interval. But, it will make unit testing easier moving forward and should make it easier to adapt to changes in lme4 if they come up or to incorporate new model types like in issue #101

What say you @carlbfrederick - does this sound like a good plan? Any feeling on names/conventions/structure? This is mostly off the top of my head.

jknowles avatar Jun 01 '19 16:06 jknowles

Hi @jknowles,

I have my own version of predictInterval that I butchered so I could use it for online predictions behind an API, so I needed it to be very fast < 1 second for each call. Would the above predict_setup() be doing all the calls to ranef() and storing those results, so the actual predictInterval never has to call those again? That's where I found most/all of the time is being spent. E.g. I. have something like this in my "predict_setup()":

    cache_obj$model.ranef <- ranef(model.obj)
    cache_obj$model.ranef_condvar <- ranef(model.obj, condVar = TRUE)

And I pass that cache_obj back into predictInterval and wherever it was calling randef() I just made it reference the cached version.

Could I make one additional suggestion: To allow the "levels" parameter to be a vector, so you can have different intervals for each row. This would also make it work the same as predict.lm and other predict functions in R, where the level can be given as a vector parameter. I have done this to my own version of predictInterval, it requires about 10 lines of code.

So basically predictInterval(model, newdata, level = c(0.8, 0.9, 0.7)) or more likely to be used as.. predictInterval(model, newdata, level = newdata$level)

E.g.:

  # extend functionality to support different levels per row in newdata
  if (length(level) == 1) {
    level <- rep(level, nrow(newdata)) 
  } else if (length(level) != nrow(newdata)) {
    warning("Length of level does not match number of rows in newdata")
  }

<snip>

  # Output prediction intervals
  if (stat.type == "median") {
    #outs[, 1:3] <- t(apply(yhat, 1, quantile, prob = c(0.5, upCI, loCI),
    #                       na.rm=TRUE))
    outs[, 1:3] <- t(sapply(1:nrow(yhat), function(i) quantile(yhat[i, ], prob = c(0.5, upCI[i], loCI[i]), na.rm=TRUE)))
  }
  if (stat.type == "mean") {
    # outs$fit <- apply(yhat, 1, mean, na.rm=TRUE)
    # outs[, 2:3] <- t(apply(yhat, 1, quantile, prob = c(upCI, loCI),
    #                        na.rm=TRUE))
    outs[, 2:3] <- t(sapply(1:nrow(yhat), function(i) quantile(yhat[i, ], prob = c(upCI[i], loCI[i]),
                           na.rm=TRUE)))
}

<snip>

      if( stat.type == "median"){
        # pi.comps[[i]] <-  t(apply(pi.comps[[i]], 1, quantile,
        #                           prob = c(0.5, upCI, loCI), na.rm=TRUE))
        pi.comps[[i]] <- t(sapply(1:nrow(pi.comps[[i]]), function(x) quantile(prob = c(0.5, upCI[x], loCI[x]), na.rm=TRUE)))
        pi.comps[[i]] <- as.data.frame(pi.comps[[i]])
        names(pi.comps[[i]]) <- c("fit", "upr", "lwr")
      }
      if(stat.type == "mean"){
        tmp <- pi.comps[[i]]
        pi.comps[[i]] <- data.frame("fit" = rep(NA, N), "upr" =NA,
                                    "lwr" = NA)
        pi.comps[[i]]$fit <- apply(tmp, 1, mean, na.rm=TRUE)
        # pi.comps[[i]][, 2:3] <- t(apply(tmp, 1, quantile, prob = c(upCI, loCI), na.rm=TRUE))
        pi.comps[[i]][, 2:3] <- t(sapply(1:nrow(tmp), function(x) quantile(prob = c(upCI[x], loCI[x]), na.rm=TRUE)))
      }

humana avatar Jun 02 '22 08:06 humana

This is great - caching the random effects is definitely what I had in mind. Depending on the size of the model we may even be able to get more speed caching the simulated matrix instead of generating it each times in sample_re() for example, but I'd need to do some testing to find the tipping point there.

I like the idea of making the intervals a vector. The original use case for predictInterval was to generate predictions on very large models with millions of rows and hundreds of thousands of levels, so adding additional interval rows wasn't on my mind. But now RAM / CPU speed means we wouldn't need to worry about that as much, and I can see the value in many places when people want to compare uncertainty levels or visualize them.

Thanks for the suggestions!

jknowles avatar Jun 02 '22 14:06 jknowles