Performance regression for BernoulliLogit
I was just playing around a bit with https://github.com/torfjelde/TuringBenchmarking.jl and noticed a sudden change in the runtime described in the README (the example model is suddenly 16x slower for gradient evaluation for ReverseDiff with compiled mode).
I eventually narrowed it down to #1892 being the cause, i.e. the performance of the following model:
@model function irt(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
Turing.@addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
return (; theta, beta)
end
absolutely tanks for ReverseDiff when we use the implementation of BernoulliLogit from Distributions.jl :confused:
┌ Info: Turing.jl
│ run(suite) =
│ 2-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "linked" => 3-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "evaluation" => Trial(1.333 ms)
│ "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.752 ms)
│ "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(174.759 ms)
│ "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "evaluation" => Trial(1.339 ms)
│ "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.796 ms)
└ "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(169.376 ms)
while on [email protected]
┌ Info: Turing.jl
│ run(suite) =
│ 2-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "linked" => 3-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "evaluation" => Trial(554.568 μs)
│ "Turing.Essential.ReverseDiffAD{true}()" => Trial(16.418 ms)
│ "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(140.508 ms)
│ "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
│ tags: []
│ "evaluation" => Trial(554.415 μs)
│ "Turing.Essential.ReverseDiffAD{true}()" => Trial(16.445 ms)
└ "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(139.849 ms)
Given that evaluation and ForwardDiff is faster in the latter case, it's clearly an "issue" with ReverseDiff, but at the same time this is such a significant perf hit that it makes me a bit uncomfortable to just "leave it in" there :confused:
Thoughts? @devmotion
I don't have an immediate answer but generally the implementation in Distributions is much more specialized and hence more efficient than the previous BernoulliLogit in Turing (which fell back to BinomialLogit). So it seems there's a more ReverseDiff-specific issue here... Is there some type instability somewhere? Some type inference issue?
Do you know of a good way to check this for ReverseDiff?
It seems strange to me since AFAIK ForwardDiff is also used for broadcasting in ReverseDiff, no? So it's weird that ForwardDiff perf improves but ReverseDiff doesn't.
Maybe the branches in the new logpdf code kill performance with ReverseDiff?
That's what I was thinking too, so I tried the folllowing impl to no avail:
function Distributions.logpdf(d::BernoulliLogit, x::Real)
return (1 - x) * Distributions.logfailprob(d) + x * Distributions.logsuccprob(d)
end
perf is still bad.
No, I meant without two calls of log1pexp. Ie. something like
logpdf(d::BernoulliLogit, x::Bool) = -log1pexp(x ? -d.logitp : d.logitp)
function logpdf(d::BernoulliLogit, x::Real)
logitp = d.logitp
z = -log1pexpx(x == 0 ? logitp : -logitp)
return insupport(d, x) ? z : oftype(z, -Inf)
end
Unfortunately doesn't help :confused:
😥 What happens if you implement the gradient of the logpdf function for ReverseDiff?
Or simpler: If you do not go through Distributions but define the logpdf directly as a separate function and broadcast that one?
I guess, for debugging it could also be useful to inspect the tape that ReverseDiff creates with the different implementations.
Just commenting to let you know I've seen the comments and I'm planning on having a go at it at some point, but right now I have some more pressing TODOs so need to put this on the backlog for a bit :confused:
I guess, for debugging it could also be useful to inspect the tape that ReverseDiff creates with the different implementations.
I wrote a small package to check this (https://github.com/torfjelde/ReverseDiffDebugUtils.jl) and AFAIT, they're the same :confused:
[email protected] (which runs in ~1.4ms):

[email protected](which runs in ~18ms):

So seems like it has to be something in the reverse pass?
EDIT: Well, if they're the same or not is of course dependent on whether the broadcast instructions are actually broadcasting the same functions, which they of course aren't :facepalm:
EDIT 2: Added hacky capability of inferring the broadcasted functions, and they're indeed the same still.
Think I found a clue: with [email protected] ReverseDiffAD{false} is just as fast as ReverseDiffAD{true}, i.e. compilation doesn't help for some reason!
While on [email protected] there's a bit of a slow-down from ~1.4ms to ~1.9ms
Probably is a type-instability somewhere then?
Profiling it, it becomes clear that ReverseDiff.special_forward_exec!(inst) is the issue where inst is the ∇broadcast that is different between the two implementations.
The reverse pass (ReverseDiff.ReverseExecutor) takes up almost noting of the 17ms runtime.
Changing the model to:
@model function irt(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
tmp = BernoulliLogit.(theta[p] - beta[i])
Turing.@addlogprob! sum(logpdf.(tmp, y))
return (; theta, beta)
end
to avoid nested broadcasting, we get
julia> @benchmark $(LogDensityProblems.logdensity_and_gradient)($∂ℓ, $θ)
BenchmarkTools.Trial: 1323 samples with 1 evaluation.
Range (min … max): 3.588 ms … 5.117 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.741 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.773 ms ± 166.545 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
when compiling, which is muuuuch better than the 17ms from before.
Now, without compilation:
julia> @benchmark $(LogDensityProblems.logdensity_and_gradient)($∂ℓ, $θ)
BenchmarkTools.Trial: 453 samples with 1 evaluation.
Range (min … max): 8.556 ms … 39.326 ms ┊ GC (min … max): 0.00% … 71.77%
Time (median): 9.103 ms ┊ GC (median): 0.00%
Time (mean ± σ): 11.044 ms ± 6.140 ms ┊ GC (mean ± σ): 13.86% ± 16.86%
which is also better.
Finally, using
logpdf_bernoulli_logit(logitp, x) = x == 0 ? StatsFuns.logistic(-logitp) : StatsFuns.logistic(logitp)
logpdf_bernoulli_logit(logitp, x::Bool) = x ? StatsFuns.logistic(logitp) : StatsFuns.logistic(-logitp)
@model function irt(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
Turing.@addlogprob! sum(logpdf_bernoulli_logit.(theta[p] - beta[i], y))
return (; theta, beta)
end
we get
julia> suite = TuringBenchmarking.make_turing_suite(
model,
adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
);
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(256.576 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(457.752 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(126.160 ms)
"not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(256.936 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(457.365 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(126.680 ms)
which is significantly better (it's also 3X the speed of stan).
It is really annoying that logpdf broadcasting is costing this much though :confused:
The logpdf in your last comment is wrong though, isn't it? At least it doesn't match the one discussed above.
Uhm yes :facepalm: I mixed the logpdf and pdf impls. With the correct one we're only at:
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(549.488 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(3.246 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(133.501 ms)
"not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(550.383 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(3.834 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(134.578 ms)
Replacing LogExpFunctions.log1pexp without all the conditional statements, i.e. just log1p(exp(x)), results in
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(492.951 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(2.993 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(136.426 ms)
"not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(496.187 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(4.443 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(135.160 ms)
so a slightl improvement but not sufficient.
EDIT: Seems to have been a fluke; doesn't seem to actually matter.
https://github.com/TuringLang/Turing.jl/issues/1934#issuecomment-1377574540 doesn't matter either?
Nah :confused:
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(375.343 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(4.449 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(125.817 ms)
"Turing.Essential.ZygoteAD()" => Trial(33.744 ms)
"not_linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(378.344 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(3.554 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(120.604 ms)
"Turing.Essential.ZygoteAD()" => Trial(34.406 ms)
Okay, I think I've found the issue, or at least the explanation.
In the one with [email protected] the broadcast results in a simdloop, which makes it so that:
- Forward pass is much faster.
- Reverse pass is dirt cheap.
So I decided the only logical thing to do is to add more broadcasting in the hopes that this, for some reason, would trigger similar things after the LogitBernoulli change, i.e. I did this:
julia> logpdf_bernoulli_logit(logitp, x) = -log1pexp(x == 0 ? -logitp : logitp)
logpdf_bernoulli_logit (generic function with 2 methods)
julia> logpdf_bernoulli_logit(logitp, x::Bool) = -log1pexp(x ? logitp : -logitp)
logpdf_bernoulli_logit (generic function with 2 methods)
julia> # performant model
@model function irt(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
Turing.@addlogprob! sum(logpdf_bernoulli_logit.(theta[p] .- beta[i], y))
return (; theta, beta)
end
irt (generic function with 2 methods)
julia> # Instantiate
model = irt(y, i, p);
julia> suite = TuringBenchmarking.make_turing_suite(
model,
adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}(), TuringBenchmarking.ZygoteAD()]
);
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(380.079 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(840.097 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(100.826 ms)
"Turing.Essential.ZygoteAD()" => Trial(33.692 ms)
"not_linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(379.018 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(833.076 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(100.951 ms)
"Turing.Essential.ZygoteAD()" => Trial(33.791 ms)
julia> # performant model
@model function irt(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
Turing.@addlogprob! sum(logpdf_bernoulli_logit.(theta[p] - beta[i], y)) # dont' broadcast `-`
return (; theta, beta)
end
irt (generic function with 2 methods)
julia> # Instantiate
model = irt(y, i, p);
julia> suite = TuringBenchmarking.make_turing_suite(
model,
adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}(), TuringBenchmarking.ZygoteAD()]
);
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(380.220 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(3.054 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(124.792 ms)
"Turing.Essential.ZygoteAD()" => Trial(33.673 ms)
"not_linked" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(380.720 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(3.575 ms)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(124.956 ms)
"Turing.Essential.ZygoteAD()" => Trial(33.465 ms)
i.e. not broadcasting over the Distribution but replace it with a simpler method and broadcasting over - results in an implementation even faster than the original one on [email protected].
Which is even more annoying! So just the right amount of broadcasting leads to great performance, but if you do too little or too much, you're screwed.
Yeah, broadcasting performance issues and gotchas are about the worst... Countless hours that went into these things in the SciML ecosystem as well.
But are there any "guidelines" or just general advice on how to:
- Identify these.
- Fix them.
? :confused:
The usual workflow I experienced was that someone notices performance issues and then one starts debugging and finally notices that's broadcasting related (e.g., there was (is?) also a limit after how many broadcasting operations performance completely degrades, I'll try to dig up the relevant issues).
place it with a simpler method
BTW that's what DistributionsAD does for many univariate distributions with flatten: https://github.com/TuringLang/DistributionsAD.jl/blob/master/src/flatten.jl It's used mainly/only in filldist but maybe it would be useful more generally. Even though in principle ideally it would not be needed.
BTW that's what DistributionsAD does for many univariate distributions with flatten
Woah, I was completely unaware of this! And yeah this might be very helpful.
That indeed does wonders:
julia> # Using `DistributionsAD.flatten` to address performance.
using Distributions, DistributionsAD
julia> """
get_logpdf_expr(Tdist)
Return a flattened method for computing the logpdf of `Tdist`.
"""
function get_logpdf_expr(Tdist)
x = gensym()
fnames = fieldnames(Tdist)
func = Expr(:->,
Expr(:tuple, fnames..., x),
Expr(:block,
Expr(:call, :logpdf,
Expr(:call, :($Tdist), fnames...),
x,
)
)
)
return :(flatten(::Type{<:$Tdist}) = $func)
end
get_logpdf_expr
julia> # 1. Use `flatten` to extract a, well, flattened `logpdf`.
eval(get_logpdf_expr(BernoulliLogit))
flatten (generic function with 1 method)
julia> # 2. [OPTIONAL] Use `StructArrays.jl` to avoid the initial call to the constructor entirely.
# 3. Define a "fast" logpdf method.
@generated function fast_logpdf(
dist::Product{V,D,<:StructVector{<:Any,<:NamedTuple{names}}},
x::AbstractArray
) where {V,D<:UnivariateDistribution,names}
# Get the flatten expression.
f = flatten(D)
args = [:(dist.v.$n) for n in names]
return :(sum($f.($(args...), x)))
end
fast_logpdf (generic function with 2 methods)
julia> # 4. Convenience method for constructing `StructArray` without
function DistributionsAD.arraydist(::Type{D}, args...) where {D<:Distribution}
return DistributionsAD.arraydist(D, NamedTuple{fieldnames(D)}(args))
end
julia> DistributionsAD.arraydist(::Type{D}; args...) where {D<:Distribution} = DistributionsAD.arraydists(D, NamedTuple(args))
julia> function DistributionsAD.arraydist(::Type{D}, args::NamedTuple) where {D<:Distribution}
# TODO: Use `purename`?
return DistributionsAD.arraydist(StructArray{D}(args))
end
julia> # 5. Type-piracy so we can make use of `~`.
function Distributions.logpdf(dist::Product{<:Any,<:UnivariateDistribution,<:StructVector}, x::AbstractVector{<:Real})
return fast_logpdf(dist, x)
end
julia> @model function irt_vroom(y, i, p; I = maximum(i), P = maximum(p))
theta ~ filldist(Normal(), P)
beta ~ filldist(Normal(), I)
y ~ arraydist(BernoulliLogit, theta[p] - beta[i])
return (; theta, beta)
end
irt_vroom (generic function with 2 methods)
julia> model = irt_vroom(y, i, p);
julia> suite = TuringBenchmarking.make_turing_suite(
model,
adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
);
julia> run(suite)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(389.573 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(747.912 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(127.035 ms)
"not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"evaluation" => Trial(391.116 μs)
"Turing.Essential.ReverseDiffAD{true}()" => Trial(745.925 μs)
"Turing.Essential.ForwardDiffAD{40, true}()" => Trial(126.951 ms)
Note that the usage of StructArray to not have to go through the initial constructor is necessary (unless there's another way of avoiding the constructor), otherwise we degrade back to ~3ms as before.