DHARMa icon indicating copy to clipboard operation
DHARMa copied to clipboard

DHARMa.ecdf function

Open florianhartig opened this issue 3 years ago • 4 comments

See https://github.com/florianhartig/DHARMa/issues/195

florianhartig avatar Aug 22 '20 13:08 florianhartig

Hello Florian,

i'd like to do something productive on the upcoming weekend and i figured this would be a good place to start. However before i do, it would be good, if you clarified the intended behavior of DHARMa.ecdf.

I've taken a look at the current behavior and would like to now waht values it should take for x_check when given x1 and x2

DHARMa.ecdf <- function (x)
{
  x <- sort(x)
  n <- length(x)
  if (n < 1)
    stop(paste("DHARMa.ecdf - length vector < 1", x))
  vals <- unique(x)
  rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/ (n +1),
                    method = "linear", yleft = 0, yright = 1, ties = "ordered")
  class(rval) <- c("ecdf", "stepfun", class(rval))
  assign("nobs", n, envir = environment(rval))
  attr(rval, "call") <- sys.call()
  rval
}
#almost zero
e <- 10^-14
# interesting values
x_check <- c(-0.5, -e, 0, e, 0.5, 1-e, 1, 1+e, 1.5)

x1 <- rep(0, 10)
#error
DHARMa.ecdf(x1)

# functional example
x2 <- rep(0:1, 5) 
loc_fun <- DHARMa.ecdf(x2)
loc_fun(x_check)
plot(as.factor(x_check), loc_fun(x_check))
abline(h = 1-1/(length(x2)+1), lty = 2)
curve(loc_fun(x), xlim = c(-0.1, 1.5))
abline(h = 1-1/(length(x2)+1), lty = 2)

Istalan avatar Sep 29 '20 19:09 Istalan

Hi Lukas,

thanks for that, good idea. I honestly can't recall what the reason was to implement a special ecdf, but I suspect my reasoning was that the default R ecdf is a step function, as you can see in

> ecdf(c(0,1))(0.5)
[1] 0.5
> ecdf(c(0,1))(0.2)
[1] 0.5

and I recall that I would have liked a more continuous ecdf, so that the residuals in the extreme case of few simulations are not all being set to the same value. As you can see, the current ecdf for (0,1) spreads the values 1/3 out 1/3 (linear extrapolated), and 1/3 out again.

> DHARMa:::DHARMa.ecdf(c(0,1))(0.2)
[1] 0.4
> DHARMa:::DHARMa.ecdf(c(0,1))(0.5)
[1] 0.5
> DHARMa:::DHARMa.ecdf(c(0,1))(0.1)
[1] 0.3666667

I'm not sure if that's really the best solution, but it seems better to me than the default.

florianhartig avatar Oct 02 '20 08:10 florianhartig

In general, feel free to modify this function, I guess a base question though (before coding) is how a sensible ecdf should be defined in this context.

p.s.: two other items that could be interesting to tackle is a better / more stable solution for the outlier test, and the performance of the dispersion test (see recent issue) - I think that the current dispersion test is not very sensitive if REs are re-simulated, and sensitivity increases if one switches this off. However, it's also not clear to me if the test statistics that I'm using is ideal (I never made a systematic comparison). In general, the question how to best specify a dispersion test for this simulated quantile residuals (and if the current test could be improved) would be something worth looking at, and it is on my list, just no time so far.

Best, Florian

florianhartig avatar Oct 02 '20 08:10 florianhartig

Hi Florian,

I had kind of been hoping for an easy win, but i see that this question requires a bit more thought than i had hoped. I've tried some stuff with density-kernels, but i'm increasingly certain that they are unsuited for this task.

Here are some things i'm quite sure of:

  1. this function only gets used for the old way of calculating quantile residuals. So technically it only deals with continuous data, as integer values get jittered before being used.(Also it is very low priority.)
  2. It needs to be 0, 1 outside of the limits of x to keep with the definition of outlier.
  3. For the same reason the values at the limits can't be 0, 1.
  4. Continuity is probably a good idea.

Now the uncertainty comes from 2 other points:

  1. What does symmetric mean for an ecdf? It was propably just a throwaway line from you in the details of the current function.
  2. Should the function incorporate that, on average, 1/(nSim +1) probability-mass is outside of x's limits on either margin. You did by dividing by nSim +1 (well n +1, n = length(x)). Not sure if that's the correct way of doing it.

Having written it all out, i'm starting to think that it is best to have a ecdf from 1/(nSim +1) to 1 -1/(nSim +1) with all the steps replaced by line segments. It would certainly have simplicity on its side. I'll sleep on it.

That's it for today, but i also have a nice if not fully tested idea for testDispersion, which i'll write in #201 soonish.

Best, Lukas

P.S. Don't hold your breath on me having an idea for the outliertest. I got absolutely nothing.

Istalan avatar Oct 05 '20 22:10 Istalan