Turing.jl icon indicating copy to clipboard operation
Turing.jl copied to clipboard

Improper initial values when supports of the prior distribution are themselves random

Open bdeonovic opened this issue 5 years ago • 27 comments

While trying out my first example in Turing I ran into an issue (see discussion here: https://discourse.julialang.org/t/help-with-first-non-trivial-turing-example/38964/14)

Apparently Turing does not provide proper initial values when supports of the prior distribution are themselves random. Below is an example in which the issue arises. Note that the example may have other issues as well:

# Import Turing and Distributions.
using Turing, Distributions

n = 1000
m = 10
k = 4
theta = randn(n)
b = zeros(k,m)
for i in 1:m
    b[1,i] = randn()
    for j in 2:k
        dd = truncated(Normal(), b[j-1,i], Inf)
        b[j,i] = rand(dd)
    end
end

logit = x -> log(x / (1 - x))
invlogit = x -> exp(x)/(1 + exp(x))
y = zeros(m,n)
probs = zeros(k,m,n)
for p in 1:n
    for i in 1:m
        probs[1,i,p] = 1.0
        for j in 1:(k-1)
            Q = invlogit(theta[p] - b[j,i])
            probs[j,i,p] -= Q
            probs[j+1,i,p] = Q
        end
        y[i,p] = rand(Categorical(probs[:,i,p]))
    end
end



# Graded Response Model
@model function grm(y, n, m, k, ::Type{TC}=Array{Float64,3}, ::Type{TM}=Array{Float64,2}, ::Type{TV}=Vector{Float64}) where {TC, TM, TV}
    b = TM(undef, k, m)
    for i in 1:m
        b[1,i] ~ Normal(0,1)
        for j in 2:k
            b[j,i] ~ truncated(Normal(0,1), b[j-1,i], Inf) 
        end
    end
    probs = TC(undef, k, m, n)
    theta = TV(undef, n)
    for p in 1:n
        theta[p] ~ Normal(0,1)
        for i in 1:m
            probs[1,i,p] = 1.0
            for j in 1:(k-1)
                Q = invlogit(theta[p] - b[j,i])
                probs[j,i,p] -= Q
                probs[j+1,i,p] = Q
            end
            y[i,p] ~ Categorical(probs[:,i,p])
        end
    end
    return theta, b
end;

result = sample(grm(y, n, m, k), MH(), 1)

With error

Sampling: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:01
ERROR: ArgumentError: Categorical: the condition isprobvec(p) is not satisfied.

Here probs[:i,p] should be guarenteed to be a probability vector as long as the b[:,i] are in ascending order. If we put in some debugging print statements we find that at times the b[:,i] are not in ascending order.

bdeonovic avatar May 07 '20 20:05 bdeonovic

Thanks @bdeonovic for opening this. Fixing this issue requires a non-trivial redesign of parts of DynamicPPL. I am on it though.

mohdibntarek avatar May 09 '20 11:05 mohdibntarek

Btw this particular case can be worked around using a custom distribution and a custom bijector:

using Distributions, Bijectors

struct MyDistribution <: ContinuousMultivariateDistribution
    n::Int
end

function Base.rand(d::MyDistribution)
    b = zeros(d.n)
    b[1] = rand(Normal(0,1))
        for i in 2:d.n
        b[i] = rand(truncated(Normal(0,1), b[i-1], Inf))
    end
    return b
end
function Distributions.logpdf(d::MyDistribution, b::AbstractVector)
    l = logpdf(Normal(0,1), b[1])
    for i in 2:d.n
        l += logpdf(truncated(Normal(0,1), b[i-1], Inf), b[i])
    end
    return l
end

struct MyBijector <: Bijectors.Bijector{1} end
function (b::MyBijector)(x::AbstractVector)
    y = similar(x)
    y[1] = x[1]
    for i in 2:length(x)
        y[i] = bijector(truncated(Normal(0,1), x[i-1], Inf))(x[i])
    end
    return y
end
function (b::Inverse{<:MyBijector})(y::AbstractVector)
    x = similar(y)
    x[1] = y[1]
    for i in 2:length(y)
        x[i] = inv(bijector(truncated(Normal(0,1), x[i-1], Inf)))(y[i])
    end
    return x
end
function Bijectors.logabsdetjac(b::MyBijector, x::AbstractVector)
    l = float(zero(eltype(x)))
    for i in 2:length(x)
        l += logabsdetjac(bijector(truncated(Normal(0,1), x[i-1], Inf)), x[i])
    end
    return l
end
Bijectors.bijector(d::MyDistribution) = MyBijector()

mohdibntarek avatar May 09 '20 15:05 mohdibntarek

Thanks @mohamed82008, could you point me towards some material for how I can now use the defined custom distribution and bijector in order to sample from the posterior distribution?

bdeonovic avatar May 11 '20 19:05 bdeonovic

Once you have the above definitions you can use MyDistribution as a prior in the model like any other distribution. So your model becomes:

# Graded Response Model
@model function grm(y, n, m, k, ::Type{TC}=Array{Float64,3}, ::Type{TM}=Array{Float64,2}, ::Type{TV}=Vector{Float64}) where {TC, TM, TV}
    b = TM(undef, k, m)
    for i in 1:m
        b[1,:] ~ MyDistribution(k)
    end
    probs = TC(undef, k, m, n)
    theta = TV(undef, n)
    for p in 1:n
        theta[p] ~ Normal(0,1)
        for i in 1:m
            probs[1,i,p] = 1.0
            for j in 1:(k-1)
                Q = invlogit(theta[p] - b[j,i])
                probs[j,i,p] -= Q
                probs[j+1,i,p] = Q
            end
            y[i,p] ~ Categorical(probs[:,i,p])
        end
    end
    return theta, b
end;

mohdibntarek avatar May 12 '20 09:05 mohdibntarek

Currently getting this error:

julia> result = sample(grm(y, n, m, k), MH(), 10)
ERROR: MethodError: no method matching reconstruct(::Array{MyDistribution,1}, ::Array{Float64,1})

If I try another sampler (like HMC) it doesn't budge from the initial values:

julia> result = sample(grm(y, n, m, k), HMC(0.2, 3), 10)
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC C:\Users\Benjamin\.julia\packages\AdvancedHMC\WJCQA\src\hamiltonian.jl:47

bdeonovic avatar May 12 '20 14:05 bdeonovic

This works for me:


using Turing
using Distributions, Bijectors

struct MyDistribution <: ContinuousMultivariateDistribution
    n::Int
end

function Base.rand(d::MyDistribution)
    b = zeros(d.n)
    b[1] = rand(Normal(0,1))
        for i in 2:d.n
        b[i] = rand(truncated(Normal(0,1), b[i-1], Inf))
    end
    return b
end
function Distributions.logpdf(d::MyDistribution, b::AbstractVector)
    l = logpdf(Normal(0,1), b[1])
    for i in 2:d.n
        l += logpdf(truncated(Normal(0,1), b[i-1], Inf), b[i])
    end
    return l
end

struct MyBijector <: Bijectors.Bijector{1} end
function (b::MyBijector)(x::AbstractVector)
    y = similar(x)
    y[1] = x[1]
    for i in 2:length(x)
        y[i] = bijector(truncated(Normal(0,1), x[i-1], Inf))(x[i])
    end
    return y
end
function (b::Inverse{<:MyBijector})(y::AbstractVector)
    x = similar(y)
    x[1] = y[1]
    for i in 2:length(y)
        x[i] = inv(bijector(truncated(Normal(0,1), x[i-1], Inf)))(y[i])
    end
    return x
end
function Bijectors.logabsdetjac(b::MyBijector, x::AbstractVector)
    l = float(zero(eltype(x)))
    for i in 2:length(x)
        l += logabsdetjac(bijector(truncated(Normal(0,1), x[i-1], Inf)), x[i])
    end
    return l
end
Bijectors.bijector(d::MyDistribution) = MyBijector()

n = 10
m = 10
k = 4
theta = randn(n)
b = zeros(k,m)
for i in 1:m
    b[1,i] = randn()
    for j in 2:k
        dd = truncated(Normal(), b[j-1,i], Inf)
        b[j,i] = rand(dd)
    end
end

logit = x -> log(x / (1 - x))
invlogit = x -> exp(x)/(1 + exp(x))
y = zeros(m,n)
probs = zeros(k,m,n)
for p in 1:n
    for i in 1:m
        probs[1,i,p] = 1.0
        for j in 1:(k-1)
            Q = invlogit(theta[p] - b[j,i])
            probs[j,i,p] -= Q
            probs[j+1,i,p] = Q
        end
        y[i,p] = rand(Categorical(probs[:,i,p]))
    end
end

# Graded Response Model
@model function grm(y, n, m, k, ::Type{TC}=Array{Float64,3}, ::Type{TM}=Array{Float64,2}, ::Type{TV}=Vector{Float64}) where {TC, TM, TV}
    b = TM(undef, k, m)
    for i in 1:m
        b[:,i] ~ MyDistribution(k)
    end
    probs = TC(undef, k, m, n)
    theta = TV(undef, n)
    for p in 1:n
        theta[p] ~ Normal(0,1)
        for i in 1:m
            probs[1,i,p] = 1.0
            for j in 1:(k-1)
                Q = invlogit(theta[p] - b[j,i])
                probs[j,i,p] -= Q
                probs[j+1,i,p] = Q
            end
            y[i,p] ~ Categorical(probs[:,i,p])
        end
    end
    return theta, b
end;

result = sample(grm(y, n, m, k), HMC(0.01, 4), 100)

mohdibntarek avatar May 12 '20 15:05 mohdibntarek

Thanks a lot @mohamed82008! I've simplified the code a bit (and allowed the categories to be variable for each item. The following code is a bit shorter. Only problem is that it is a bit slow to sample. Any suggestions?

# Import Turing and Distributions.
using Turing, Distributions, Bijectors

struct MyDistribution <: ContinuousMultivariateDistribution
    n::Int
end

function Distributions.length(d::MyDistribution)
    d.n
end

function Base.rand(d::MyDistribution)
	b = zeros(d.n)
    b[1] = rand(Normal(0,1))
	for i in 2:d.n
        b[i] = rand(truncated(Normal(0,1), b[i-1], Inf))
    end
    return b
end
function Distributions._logpdf(d::MyDistribution, b::AbstractVector)
    l = logpdf(Normal(0,1), b[1])
	for i in 2:d.n
        l += logpdf(truncated(Normal(0,1), b[i-1], Inf), b[i])
    end
    return l
end

struct MyBijector <: Bijectors.Bijector{1} end
function (b::MyBijector)(x::AbstractVector)
	y = similar(x)
	y[1] = x[1]
	for i in 2:length(x)
		y[i] = bijector(truncated(Normal(0,1), x[i-1], Inf))(x[i])
	end
	return y
end
function (b::Inverse{<:MyBijector})(y::AbstractVector)
	x = similar(y)
	x[1] = y[1]
	for i in 2:length(y)
		x[i] = inv(bijector(truncated(Normal(0,1), x[i-1], Inf)))(y[i])
	end
	return x
end
function Bijectors.logabsdetjac(b::MyBijector, x::AbstractVector)
	l = float(zero(eltype(x)))
	for i in 2:length(x)
		l += logabsdetjac(bijector(truncated(Normal(0,1), x[i-1], Inf)), x[i])
	end
	return l
end
Bijectors.bijector(d::MyDistribution)= MyBijector()

n = 1000
m = 3
ks = [3,4,5]
theta = randn(n)
b = Array{Float64,1}.(undef, ks .- 1)
for i in 1:m
    b[i] .= rand(MyDistribution(ks[i]-1))
end

y = zeros(m,n)
for p in 1:n
    for i in 1:m
       y[i,p] = rand(OrderedLogistic(theta[p], b[i]))
    end
end

# Graded Response Model
@model function grm(y, n, m, ks, ::Type{TV}=Vector{Float64}) where {TV}
    b = TV.(undef, ks .- 1)
    for i in 1:m
      b[i] ~ MyDistribution(ks[i]-1)
    end
    theta = TV(undef, n)
    for p in 1:n
        theta[p] ~ Normal(0,1)
        for i in 1:m
            y[i,p] ~ OrderedLogistic(theta[p], b[i])
        end
    end
    return theta, b
end;

result = sample(grm(y, n, m, ks), HMC(0.01,4), 10);

bdeonovic avatar May 12 '20 17:05 bdeonovic

Try:

using ReverseDiff, Memoization
Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

before the sample line.

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

~~Hmm actually the caching may not work well with OrderedLogistic because it has branches inside. So ignore the Turing.setrdcache(true) and if you set it true already, set it back to false.~~ See the correction below.

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

Alternatively, we can define a ReverseDiff custom adjoint for the logpdf of OrderedLogistic and then caching should work.

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

Hmm on third thought, caching should work fine because the branch taken by the logpdf function for each loop iteration is the same each time the function is called because y is fixed. So ignore my previous comment :)

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

You may need to call Turing.emptyrdcache() when changing y though.

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

I just tried it....I tabbed over to read about something else and when I looked back the sampling was already done! I'm rerunning it out of disbelief.

bdeonovic avatar May 12 '20 17:05 bdeonovic

Hahaha that's ReverseDiff for you! Check the results against other backends for a small data case before committing fully.

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

Always good to run some sanity tests :)

mohdibntarek avatar May 12 '20 17:05 mohdibntarek

One more thing; For the bijector, if I wanted MyDistribution to take in as another parameter a "base" distribution

struct MyDistribution <: ContinuousMultivariateDistribution
  d::ContinuousUnivariateDistribution  
  n::Int
end

(rather than have a Normal(0,1) hardcoded everywhere) how would I modify the inverse bijector properly? I think I figured out the other parts.

bdeonovic avatar May 12 '20 18:05 bdeonovic

You want to store that base distribution in your custom bijector. Then instead of bijector(truncated(Normal(0, 1), x, Inf)) do bijector(truncated(b.dist, x, Inf)). Make sure you use type parameters for performance though.

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

The bijector definition makes sense to me, but I don't understand how to make the inverse work

struct OrderedNormalBijector <: Bijectors.Bijector{1} 
    d::ContinuousUnivariateDistribution
end

function (b::OrderedNormalBijector)(x::AbstractVector)
	y = similar(x)
	y[1] = x[1]
	for i in 2:length(x)
		y[i] = bijector(truncated(b.d, x[i-1], Inf))(x[i])
	end
	return y
end
function (b::Inverse{<:OrderedNormalBijector})(y::AbstractVector)
	x = similar(y)
	x[1] = y[1]
	for i in 2:length(y)
		x[i] = inv(bijector(truncated(b.d, x[i-1], Inf)))(y[i]) ## !!!! ???
	end
	return x
end

bdeonovic avatar May 12 '20 18:05 bdeonovic

Yes that's correct. But use a type parameter for the d field.

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

Sorry, I'm not sure I'm following. I think I understand what you mean by parameterizing the bijector (ie something like)

struct OrderedNormalBijector{T <: ContinuousUnivariateDistribution} <: Bijectors.Bijector{1} 
    d::T
end

but in the inverse function how do I reference this distribution? In the inverse function b doesn't have parameter d.

bdeonovic avatar May 12 '20 18:05 bdeonovic

Oh sorry it should be b.orig.d.

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

@bdeonovic it would be nice if you could write a blog post or something about this if you can. This is an important side of Turing, custom distributions and bijectors, that I think deserves more people to be talking about :) If you have no time, that's ok, just thought it was worth asking :)

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

I would love to as soon as I get everything sorted out and looking nice :)

I've never seen this orig notation, where can I read more about this?

BTW thanks for all of your help! I'm really impressed with Turing. When I was doing my PhD I helped develop one of the earlier MCMC packages in julia (Mamba.jl) but Turing is so much more flexible! Really impressive

bdeonovic avatar May 12 '20 18:05 bdeonovic

Well, inv(b::Inverse) is equivalent to b.orig so may be use the first if you don't want to access internal field names. The bijector type Inverse just stores the original bijector in a field called orig.

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

Thanks for your kind words :)

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

Btw I am almost done with a more "proper" fix for this stochastic support issue but it does add a fair bit of complexity to the internals. So we still need to decide whether we should have this fix or just error when we detect a change in the support and refer the user to the custom distribution and bijector trick.

mohdibntarek avatar May 12 '20 18:05 mohdibntarek

For some reason this issue came to my attention now, and I just wanted to add that defining such a Bijector should be trivial in the future when we have CouplingLayer from https://github.com/TuringLang/Bijectors.jl/pull/48. With that PR the impl is just

function Bijectors.bijector(d::MyDistribution)
    n = length(d)

    bs = map(2:n) do i
        # 1. use `x[i - 1]` to parameterize transformation of `x[i]`
        m = PartitionMask(n, [i], [i - 1])
        # 2. Create the coupling transformation
        CouplingLayer(x -> TruncatedBijector(first(x), Inf), m)
    end

    # 3. Since we want to parameterize the transformation of `x[i]` using `x[i - 1]`,
    # in contrast to using `y[i - 1]`, we apply from highest-to-lowest, e.g.
    #     1. Y₁ := [x₁, …, xₙ₋₁, yₙ] = bs[n](x)
    #     2. Y₂ := [x₁, …, yₙ₋₁, yₙ] = bs[n - 1](Y₁)
    #     ...
    #     n. Yₙ := [y₁, …, yₙ₋₁, yₙ] = bs[1](Yₙ₋₁)
    # And `Composed` will apply from left-to-right, so we have to `reverse` `bs` to
    # get the above behavior.
    return Composed(reverse(bs))
end

then

julia> d = MyDistribution(100);

julia> b = bijector(d);

julia> x = rand(d);

julia> y = b(x);

julia> inv(b)(y) ≈ x
true

julia> x[1:5]
5-element Array{Float64,1}:
 -1.0337803916080368
 -0.9118248542646143
 -0.5002510997246266
  1.4872200369231863
  1.5833536203613594

julia> y[1:5] # notice `y[1]` unchanged as desired
5-element Array{Float64,1}:
 -1.0337803916080368
 -2.1040987486809284
 -0.8877670415353908
  0.6868630449995462
 -2.3420165605974854

I'm not suggesting using this ongoing work instead of the fix suggested earlier in the issue; best to use the currently working approach:)

torfjelde avatar May 18 '20 00:05 torfjelde