loo icon indicating copy to clipboard operation
loo copied to clipboard

extend gpdfit to return marginal posterior of k #186

Open ozanstats opened this issue 3 years ago • 13 comments

Addresses #186: @jgabry @avehtari

Extended gpdfit() to return the marginal posterior density of k in addition to k-hat and sigma-hat. We can also create a separate function but this seemed like a better approach to me.

  1. Updated the profile log-likelihood calculation by using an outer product (%o%) instead of vapply, which made it more efficient.
# switched to
k <- matrixStats::rowMeans2(log1p(-theta %o% x))
N * (log(-theta / k) - k - 1)

# instead of
a <- -theta
k <- vapply(-a, FUN = function(a_i) mean(log1p(a_i * x)), FUN.VALUE = numeric(1))
N * (log(-a / k) - k - 1)
  1. Dropped the lx() function and moved its updated content to gpdfit() to avoid repeated calculations because the code example in the issue thread actually computes the k values twice:
# first (within the `lx()` function)
l_theta <- N * lx(theta, x) # profile log-lik

# second 
ks <- numeric(length=M)
for (i in 1:M) {
  ks[i] <- mean.default(log1p(-theta[i] * x))
}
  1. Following the notation of Zhang, J., and Stephens, M. A. (2009), suggested new names for the gpdfit() output components:
# switched to
nlist(k_hat, sigma_hat, k, w_theta)

# instead of
nlist(k, sigma)

This name change shouldn't really cause backward compatibility issues but it's not that important anyway. I will use whatever you believe is more appropriate.

ozanstats avatar Oct 10 '21 09:10 ozanstats

@avehtari - I can update the function with normalization term estimation using the trapezoidal rule once you decide what to return. Below are some options:

# normalization term estimation with the trapezoidal rule
c <- 2 / sum((w_theta[-M] + w_theta[-1]) * (k[-M] - k[-1]))

# normalized densities
d_theta <- c * w_theta

# some return options:
nlist(k_hat, sigma_hat, k, w_theta) # normalized quadrature weights
nlist(k_hat, sigma_hat, k, d_theta) # normalized densities
nlist(k_hat, sigma_hat, k, w_theta, c) # normalized quadrature weights + density normalization term
nlist(k_hat, sigma_hat, k, w_theta, d_theta) # normalized quadrature weights + normalized densities

ozanstats avatar Oct 10 '21 21:10 ozanstats

I noticed a small side effect of this recent commit. That is, when the estimation doesn't work, k-hat and sigma-hat are not necessarily Inf and NaN, respectively:

set.seed(147)
x <- exp(rnorm(10))

# CRAN version (without the commit)
gpdfit(x, sort_x = FALSE)
$k
[1] Inf

$sigma
[1] NaN

# GitHub version on master (with the commit)
$k
[1] NA

$sigma
[1] NA

Should we avoid NAs and return always Inf and NaN here? I can also suppress the NaNs produced warnings that occur in these cases.

ozanstats avatar Oct 10 '21 21:10 ozanstats

Should we avoid NAs and return always Inf and NaN here?

I would prefer khat=Inf when the failure is due to extremely long tail. At least in the cases I've seen Inf appearing, the tail was very long.

avehtari avatar Oct 11 '21 18:10 avehtari

Thanks for the PR!

Should we avoid NAs and return always Inf and NaN here?

Yeah I think Inf for k and and NaN for sigma would be better, thanks!

Based on the discussion in the issue thread, we still need to decide whether we want to return normalized quadrature weights or normalized densities.

I haven't had a chance to think much about this. I'm ok with whatever @avehtari decides.

jgabry avatar Oct 11 '21 18:10 jgabry

Thinking for a moment, I think normalized densities would make more sense, and if someone would like to use normalized quadrature weights it's easier to obtain them from normalized densities than vice versa.

avehtari avatar Oct 11 '21 19:10 avehtari

Quick question -- is the gpd_fit posterior actually a good one? The estimator is good, but because the prior is derived from the data, the posterior will be overconfident. This doesn't matter much if you just need k-hat, but I think you'd have to test how much the empirical prior affects the overconfidence.

ParadaCarleton avatar Dec 20 '21 00:12 ParadaCarleton

I have tested. When using sample sizes that are typical gpd_fit use and when tail thicknesses are in the interesting region (<0.7), gpd_fit approximation is very good with no practical difference when compared to the marginals of the full posterior computed with Stan. I guess I should add this comment to the PSIS paper revision.

Autumn has been very busy, but I hope to be able to go through some PRs and issues in a a few weeks.

avehtari avatar Dec 20 '21 07:12 avehtari

Thanks to a question by @sethaxen, I realized that what is above named as densities are not yet densities. Originally, the algorithm computes quadrature weights and for those most natural scale would be normalized weights. To compute densities, it is not enough to compute the normalization term, as each quadrature point presents a different volume of the parameter space. I'll prepare a note with illustration and code. We can then discuss what we should return and how to document clearly what we are returning.

avehtari avatar Jan 06 '22 16:01 avehtari

Here is an illustration image

The yellow contours present the likelihood, the red dashed contours present the posterior, blue line is the profile line, the blue dots present the quadrature points. w_theta are normalized likelihood values at the qudrature points and the effect of the prior comes from non-equidistant spacing of the quadrature points. To plot the marginal posterior of k, we need to evaluate the posterior density at the quadrature points (or in any other evaluation points).

The following code change returns the quadrature weights k_w that are specific to the quadrature points k, and posterior densities k_d evaluated at the quadrature points k.

  # quadrature weights for k are same as for theta
  k_w <- w_theta
  # quadrature weights are just the normalized likelihoods
  # we get the unnormalized posterior by multiplying these by the prior
  k_d <- k_w * dgpd(-theta, mu=-1/x[N], sigma=1/prior/xstar, xi=0.5)
  # normalize using the trapezoidal rule
  Z <- sum((k_d[-M] + k_d[-1]) * (k[-M] - k[-1])) / 2
  k_d <- k_d / Z

  nlist(k_hat, sigma_hat, k, k_w, k_d)

Here is the exact marginal posterior (red dashed) and the profile posterior (blue) corresponding to the above figure image

The following plot shows comparisons with one data realization each using M=20, 100, 1000 and k=0.3, 0.5, 0.7 ,0.9. image

The exact marginal posterior is wider than the profile posterior, as expected, but the profile posterior is quite close and thus useful for presenting the uncertainty. In my experiments I used extraDistr::dgpd, but to avoid the dependency we should implement that in loo. Computing the prior added maybe 10% or less computation time, so computing the marginal approximation could be moved also to another function.

avehtari avatar Jan 14 '22 16:01 avehtari

I have tested. When using sample sizes that are typical gpd_fit use and when tail thicknesses are in the interesting region (<0.7), gpd_fit approximation is very good with no practical difference when compared to the marginals of the full posterior computed with Stan. I guess I should add this comment to the PSIS paper revision.

Autumn has been very busy, but I hope to be able to go through some PRs and issues in a a few weeks.

I see; does that hold even though the samples aren't IID? That would also tend to make the posterior too narrow. Perhaps "correcting" the log-likelihood by multiplying it with r_eff would fix that problem well enough in practice, though I'm not sure such an approach would be fully justified.

I do also wonder how Zhang's method compares to Laplace approximation for finding the posterior in terms of performance.

ParadaCarleton avatar Jan 14 '22 20:01 ParadaCarleton

I see; does that hold even though the samples aren't IID? That would also tend to make the posterior too narrow.

True. I don't think this is an issue as the plan at the moment is just to provide some indication of the uncertainty. Of course, if you plan to use it for something which needs more accuracy, then follow the usual extreme value analysis procedure and thin the chains to have approximately independent draws. For Pareto-k diagnostic and PSIS, having too narrow posterior doesn't harm if there is no big change in the posterior mean, which seems to be true based on the experiments with dependencies typical for MCMC from Stan. If you are worried, then just thin.

I do also wonder how Zhang's method compares to Laplace approximation for finding the posterior in terms of performance.

Not a huge difference, but it doesn't matter as the posterior is much more skewed than normal and the posterior mode is then clearly different from the posterior mean.

avehtari avatar Jan 17 '22 19:01 avehtari

  • tried to update gpdfit() to reflect the improvements @avehtari mentioned above
  • implemented the full suite of dgpd/pgpd/qgpd/rgpd() functions
  • cleaned up and condensed some internal helpers

@jgabry; the new functions pass my local checks and the existing tests but it'd be great if you could also give them a try, perhaps with some fresh test cases.

ozanstats avatar Jan 27 '22 09:01 ozanstats

Sorry, this has been left hanging for such a long time. After considering, I think we shouldn't change the existing output variable names (ie k -> k_hat, sigma -> sigma_hat), as it is possible the function is already used (and at least I'm using it in different simulations), and that would break those.

I'm also thinking that maybe we should have a separate function for the marginal posterior. It is almost just a by product, but still it is adding a bit of computation and memory usage, and maybe it would be better to have the streamlined one for PSIS-LOO, and then have separate function for the cases when someone wants to investigate more. It would be then possible to add also computationally more intensive, but more accurate marginal posterior computation, too.

avehtari avatar Apr 25 '22 07:04 avehtari

Adding a comment that this will very likely go to posterior package, and there is no need to merge this to loo

avehtari avatar Mar 24 '23 11:03 avehtari