ApproximateGPs.jl Benchmarking `expected

So with 1.7 (maybe 1.6 too?), there is an issue with Zygote gradients and the expec_loglik function because of it hits a BLAS function. I tried to rewrite it to make fewer allocations and here are the results:

using BenchmarkTools, Distributions, ApproximateGPs, IrrationalConstants, FastGaussQuadrature
# The new proposed version
function expected_loglik(
    gh::GaussHermite, y::AbstractVector, q_f::AbstractVector{<:Normal}, lik
)
    xs, ws = gausshermite(gh.n_points)
    return mapreduce(+, q_f, y) do q, y
        μ = mean(q)
        σ = std(q)
        mapreduce(+, xs, ws) do x, w
            f = sqrt2 * σ * x + μ
            loglikelihood(lik(f), y) * w
        end
    end / sqrtπ
end
# The previous version
function expected_loglik_old(
    gh::GaussHermite, y::AbstractVector, q_f::AbstractVector{<:Normal}, lik
)
    xs, ws = gausshermite(gh.n_points)
    fs = sqrt2 * std.(q_f) .* xs' .+ mean.(q_f)
    lls = loglikelihood.(lik.(fs), y)
    return sum(lls * ws) / √π
end
function evaluate_speed(N)
  gh = GaussHermite(100)
  lik = BernoulliLikelihood()
  y = rand(0:1, N)
  q_f = Normal.(randn(N), rand(N))
  @btime expected_loglik($gh, $y, $q_f, $lik)
  @btime expected_loglik_old($gh, $y, $q_f, $lik)
end
for N in [10, 100, 500, 1000]
  @info N
  evaluate_speed(N)
end
[ Info: 10
  164.407 μs (192 allocations: 44.45 KiB)
  139.752 μs (87 allocations: 48.83 KiB)
[ Info: 100
  396.769 μs (1002 allocations: 146.42 KiB)
  312.959 μs (89 allocations: 191.52 KiB)
[ Info: 500
  1.433 ms (4602 allocations: 599.61 KiB)
  1.027 ms (89 allocations: 826.08 KiB)
[ Info: 1000
  2.736 ms (9102 allocations: 1.14 MiB)
  1.972 ms (89 allocations: 1.58 MiB)

So the old approach is faster but make bigger allocations, I actually don't know where all this allocations come from for the first approach, any clue?

Dec 01 '21 19:12 theogf

Bigger allocations, but less of them (and only a constant number). Few large allocations is more efficient than lots of small ones, I suppose!

Dec 01 '21 20:12 st--

It might be helpful to see how it scales with number of Gauss-Hermite points, 10 vs 100 vs 1000...

Dec 01 '21 20:12 st--

mapreduce with multiple arrays just computes the resulting array up-front: https://github.com/JuliaLang/julia/blob/d16f4806e9389dbc92c463efc5b96f67a7aebf22/base/reducedim.jl#L324-L325 (added in https://github.com/JuliaLang/julia/pull/31532)

I think a better approach (that seems to break AD with basically all backends though according to my experiments in Distributions) is

sum(Broadcast.instantiate(Broadcast.broadcasted(op, args...)))

which also uses pairwise summation and is fast in recent Julia versions (https://github.com/JuliaLang/julia/pull/31020).

Dec 01 '21 20:12 devmotion

So with 1.7 (maybe 1.6 too?), there is an issue with Zygote gradients and the expec_loglik function

I'm not familiar with this package but this sounds like it could (or maybe should) be fixed in Zygote or with a ChainRule definition?

Dec 01 '21 20:12 devmotion

The really weird thing is that it seems to happen on a Matrix/Vector product...

Dec 01 '21 20:12 theogf

(Completely unrelated: You could use invsqrtπ instead of / √π or / sqrtπ)

Dec 01 '21 20:12 devmotion

SciML noticed the same Zygote issue it seems: https://github.com/SciML/DiffEqSensitivity.jl/pull/511#issuecomment-984434332

Dec 02 '21 11:12 devmotion

Follow up: I tried @devmotion proposition with broadcasted:

function expected_loglik_david(
    gh::GaussHermite, y::AbstractVector, q_f::AbstractVector{<:Normal}, lik
)
    xs, ws = gausshermite(gh.n_points)
    return sum(Broadcast.instantiate(
            Broadcast.broadcasted(q_f, y) do q, y
                μ = mean(q)
                σ = std(q)
                sum(Broadcast.instantiate(
                    Broadcast.broadcasted(xs, ws) do x, w
                        f = sqrt2 * σ * x + μ
                        loglikelihood(lik(f), y) * w
                    end
                ))
            end
        )) * invsqrtπ
end

And it definitely improves a lot! Yet the linear algebra is always the fastest

# David solution is the middle one
[ Info: 10
  165.688 μs (188 allocations: 44.08 KiB)
  158.399 μs (88 allocations: 32.47 KiB)
  154.454 μs (87 allocations: 48.55 KiB)
[ Info: 100
  411.624 μs (998 allocations: 146.06 KiB)
  351.009 μs (88 allocations: 32.47 KiB)
  307.118 μs (89 allocations: 191.22 KiB)
[ Info: 500
  1.505 ms (4598 allocations: 599.25 KiB)
  1.206 ms (88 allocations: 32.47 KiB)
  1.234 ms (89 allocations: 825.78 KiB)
[ Info: 1000
  2.882 ms (9098 allocations: 1.14 MiB)
  2.277 ms (88 allocations: 32.47 KiB)
  2.132 ms (89 allocations: 1.58 MiB)

Dec 02 '21 13:12 theogf

Now regarding AD, with Julia 1.7, Zygote 0.6.32, everything seems to pass

## Checking differentiation
using Zygote
N = 100
gh = GaussHermite(100)
lik = BernoulliLikelihood()
y = rand(0:1, N)
μ = randn(N)
σ = rand(N)
for f in [expected_loglik, expected_loglik_david, expected_loglik_old]
    g = only(Zygote.gradient(x->f(gh, y, Normal.(x, σ), lik), μ))
end

Dec 02 '21 13:12 theogf

Yet the linear algebra is always the fastest

It seems broadcasted is the fastest for N = 500 and best allocation-wise in all examples?

Dec 02 '21 20:12 devmotion

I'll make a PR to use broadcasted instead.

Jan 11 '22 16:01 theogf

Benchmarking `expected_loglik`