ParetoSmooth.jl icon indicating copy to clipboard operation
ParetoSmooth.jl copied to clipboard

Uninformative Error Message when Sample Size is 1

Open nsgrantham opened this issue 3 years ago • 5 comments

I am encountering an issue where if I pass a Turing model of type DynamicPPL.Model to psis_loo() it returns values of -Inf or NaN.

However, when I pass it a log-likelihood function and the data as an array of tuples, it behaves "correctly" (in the sense that it returns finite values, but I have not independently verified that they are accurate).

The following is a reproducible example.

using Distributions
using DataFrames
using Turing
using ParetoSmooth

function generate_persons(; n_years=1000, max_age=65, n_births=20, age_of_marriage=18)
    # https://github.com/rmcelreath/rethinking/blob/master/R/sim_happiness.R

    invlogit(x) = 1 ./ (exp(-x) .+ 1)
    age = fill(1, n_births)
    married = fill(false, n_births)
    happiness = collect(range(-2, 2, length=n_births))

    for _ in 1:n_years
        age .+= 1
        append!(age, fill(1, n_births))
        append!(married, fill(false, n_births))
        append!(happiness, collect(range(-2, 2, length=n_births)))

        for i in eachindex(age, married, happiness)
            if age[i] >= age_of_marriage && !married[i]
                married[i] = rand(Bernoulli(invlogit(happiness[i] - 4)))
            end
        end

        deaths = findall(age .> max_age)
        deleteat!(age, deaths)
        deleteat!(married, deaths)
        deleteat!(happiness, deaths)
    end

    DataFrame(
        age = age,
        married = convert.(Int, married),
        happiness = happiness
    )
end

persons = generate_persons()
adults = persons[persons.age .> 17, :]
min_age, max_age = extrema(adults.age)
adults.age = (adults.age .- min_age) ./ (max_age - min_age)

@model function adult_happiness(h, a)
    α ~ Normal(0, 1)
    β ~ Normal(0, 2)
    σ ~ Exponential(1)
    μ = α .+ β .* a
    h ~ MvNormal(μ, σ)
end

model = adult_happiness(adults.happiness, adults.age)
post = sample(model, NUTS(), 10_000)

Now, reading through the code in src/TuringHelpers.jl in this repo, it seems as though there's a method that supports Turing models directly. However, if I run

psis_loo(model, post)

then I get

┌ Warning: Some Pareto k values are extremely high (>1). PSIS will not produce consistent estimates.
└ @ ParetoSmooth ~/.julia/packages/ParetoSmooth/agZDP/src/InternalHelpers.jl:43
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points. Total Monte Carlo SE of NaN.
┌───────────┬───────┬──────────┬───────┬─────────┐
│           │ total │ se_total │  mean │ se_mean │
├───────────┼───────┼──────────┼───────┼─────────┤
│   cv_elpd │  -Inf │      NaN │  -Inf │     NaN │
│ naive_lpd │  -Inf │      NaN │  -Inf │     NaN │
│     p_eff │   NaN │      NaN │   NaN │     NaN │
└───────────┴───────┴──────────┴───────┴─────────┘

Instead, I can define the appropriate log-likelihood function and provide the data in an array of tuples:

function ll_fun(α, β, σ, data)
    h, a = data
    logpdf(Normal(α .+ β .* a, σ), h)
end

psis_loo(ll_fun, post, collect(zip(adults.happiness, adults.age)))

And this yields something that seems correct.

[ Info: No source provided for samples; variables are assumed to be from a Markov Chain. If the samples are independent, specify this with keyword argument `source=:other`.
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 960 data points. Total Monte Carlo SE of 0.019.
┌───────────┬──────────┬──────────┬───────┬─────────┐
│           │    total │ se_total │  mean │ se_mean │
├───────────┼──────────┼──────────┼───────┼─────────┤
│   cv_elpd │ -1551.04 │    13.81 │ -1.62 │    0.01 │
│ naive_lpd │ -1548.62 │    13.75 │ -1.61 │    0.01 │
│     p_eff │     2.42 │     0.08 │  0.00 │    0.00 │
└───────────┴──────────┴──────────┴───────┴─────────┘

I would expect these two approaches to yield the same result.

Am I simply using psis_loo() incorrectly when passing it a Turing model of type DynamicPPL.Model? The language that Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points seems to suggest that it does not recognize that model contains multiple data points.

For now I will continue to use the ll_fun approach, but wanted to call this out here in case there is something that needs a closer look in TuringHelpers.jl.

nsgrantham avatar Feb 16 '22 17:02 nsgrantham