Performance of high order derivatives in Enzyme is slower than finite difference
Thank you for adding the capability of computing high order derivatives with Enzyme in https://github.com/EnzymeAD/Enzyme.jl/pull/2161!
I benchmarked the performance of Enzyme against finite difference method. For orders greater than 2, I see that finite difference is faster. It is 1.7 times faster for order 3 and 2.4 times faster for order 4. I am sharing my benchmarking code for orders 3 and 4 in case it leads to any ideas for improvement. The mean performance of the third order finite difference method is 8 ns, while that of Enzyme is 14 ns. For the fourth order, finite difference is at 11 ns and Enzyme is at 27 ns.
Benchmarking code for third order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme
# Flux function to be differentiation
@inline function flux(u)
rho, rho_v1, rho_v2, rho_e = u
gamma = 1.4
v1 = rho_v1 / rho
v2 = rho_v2 / rho
p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
f1 = rho_v1
f2 = rho_v1 * v1 + p
f3 = rho_v1 * v2
f4 = (rho_e + p) * v1
return SVector(f1, f2, f3, f4)
end
# Third order finite difference derivative
function third_derivative_fd(u, du, ddu, dddu)
factor = 0.5
df = factor * (flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu)
- 2.0 * flux(u + du + 0.5 * ddu + (1.0/6.0) * dddu)
+ 2.0 * flux(u - du + 0.5 * ddu - (1.0/6.0) * dddu)
- flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu))
return df
end
# AD to compute derivatives
dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]
# Random inputs
u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)
@info "Third derivative"
@info "FD"
display(@benchmark third_derivative_fd($u, $du, $ddu, $dddu))
@info "Enzyme"
display(@benchmark dddg_ad($u, $du, $ddu, $dddu))
Benchmarking code for fourth order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme
# Flux function to be differentiation
@inline function flux(u)
rho, rho_v1, rho_v2, rho_e = u
gamma = 1.4
v1 = rho_v1 / rho
v2 = rho_v2 / rho
p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
f1 = rho_v1
f2 = rho_v1 * v1 + p
f3 = rho_v1 * v2
f4 = (rho_e + p) * v1
return SVector(f1, f2, f3, f4)
end
# Fourth order finite difference derivative
function fourth_derivative_fd(u, du, ddu, dddu, ddddu)
df = (
flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
- 4.0 * flux(u + du + 0.5 * ddu + 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
+ 6.0 * flux(u)
- 4.0 * flux(u - du + 0.5 * ddu - 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
+ flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
)
return df
end
# AD to compute derivatives
dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]
ddddg_ad(x, dx, ddx, dddx, ddddx) = autodiff(Forward, dddg_ad, DuplicatedNoNeed(x, dx),
DuplicatedNoNeed(dx, ddx),
DuplicatedNoNeed(ddx, dddx),
DuplicatedNoNeed(dddx, ddddx))[1]
# Random inputs
u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)
@info "Fourth derivative"
@info "FD"
display(@benchmark fourth_derivative_fd($u, $du, $ddu, $dddu, $dddu))
@info "Enzyme"
display(@benchmark ddddg_ad($u, $du, $ddu, $dddu, $dddu))
Benchmarking results for third order
[ Info: Third derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 7.924 ns … 20.604 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.008 ns ┊ GC (median): 0.00%
Time (mean ± σ): 8.140 ns ± 0.710 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██▅▃▃▂ ▃ ▁▁ ▂
██████▇▆▆▅▆█▆▆▆▅▃█▇▄▅▅▄▅▄▅▄▃▅▅▁▅▄▄▆▄▃▅▄▃▃▁▄▅▅▃▄▃▄▄▄▅▆▅▆▇██ █
7.92 ns Histogram: log(frequency) by time 11.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.639 ns … 26.652 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.722 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.037 ns ± 0.980 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▃ ▂ ▁▅▂▁ ▁
███▇█▇▆▆▃▆████▇▅▄▅▇▅▅▆▆▅▄▅▄▁▄▄▅▅▅▄▆▄▆▃▅▃▄▅▄▅▅▆▅▆▄▅▅▄▅▄▅▅▄▆█ █
13.6 ns Histogram: log(frequency) by time 19.1 ns <
Benchmarking results for fourth order
[ Info: Fourth derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 10.927 ns … 83.625 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 12.304 ns ┊ GC (median): 0.00%
Time (mean ± σ): 12.644 ns ± 2.999 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▃█▂ █▆▃▃▂▁▁▁ ▁▁▁▂▁▁▁▂▁▂▄▂▁▁ ▂
███████████████████████████████▅▅▄▄▅▄▅▅▄▅▄▄▄▄▄▄▅▃▄▁▄▁▄▆▅▅▆▅ █
10.9 ns Histogram: log(frequency) by time 20.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
Range (min … max): 26.230 ns … 43.257 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.397 ns ┊ GC (median): 0.00%
Time (mean ± σ): 26.774 ns ± 1.646 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆ ▁ ▁
███▆█▇▆▅▅▆▆▆▆▆▅▅▄▅▆▇▇▆▅▆▅▅▅▅▆▅▅▄▄▄▄▅▅▅▄▅▅▄▆▅▅▆▅▄▄▅▅▅▄▅▄▅▃▄▅ █
26.2 ns Histogram: log(frequency) by time 36.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
These results have been generated with Enzyme v0.13.24 using Apple M3 Pro on Julia 1.11.2.
versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
CPU: 12 × Apple M3 Pro
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 6 virtual cores)
Here is a gist that checks that the above third and fourth order derivative computations are correct by using a polynomial test case. This gist contains computation and benchmarking of all derivatives up to four.
You are on an Apple M3/M4?
On my system there is a overhead, but it is much smaller.
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
Range (min … max): 25.681 ns … 155.876 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 25.882 ns ┊ GC (median): 0.00%
Time (mean ± σ): 26.268 ns ± 4.361 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▃ ▁
███▇▅▃▄▅▅▅▄▄▁▃▃▄▃▄▄▄▁▁▃▃▄▃▄▄▃▁▃▄▄▅▄▅▃▄▄▄▅▅▄▃▅▅▄▅▁▃▄▃▃▄▅▅▆▆▇▇ █
25.7 ns Histogram: log(frequency) by time 34.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 29.251 ns … 213.054 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.805 ns ┊ GC (median): 0.00%
Time (mean ± σ): 32.373 ns ± 6.794 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▃▃▄▃▃▃▃▁ ▁▃▂▂▂▂▂▂▂▂▂▁ ▁▁ ▁ ▁
███████████████████████▇█▆▇▇▆▅████▆▇▇▇▄▇███▇▄█▄▇▅▂▂▃▅▄▃▄▂▄▄▄ █
29.3 ns Histogram: log(frequency) by time 49.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
For the fourth derivative I start to see the overhead grow:
[ Info: Fourth derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
Range (min … max): 28.631 ns … 120.559 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.551 ns ┊ GC (median): 0.00%
Time (mean ± σ): 30.569 ns ± 2.216 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▃▃█▅ ▃▃▁▃▃▁▃▁▃▃ ▃▂▁▃▁▂▂ ▂▂▂ ▃▂▁▁ ▂
██████████████████████████████▇████▆▅▅▅▅▅▅▃▄▅▅▅▅▅▄▄▁▅▃▅▃▅▄▅█ █
28.6 ns Histogram: log(frequency) by time 38.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 982 evaluations.
Range (min … max): 53.267 ns … 92.200 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 54.267 ns ┊ GC (median): 0.00%
Time (mean ± σ): 54.501 ns ± 1.301 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▂▃
▂▂▂▂▃▆████▅▄▃▂▂▃▄▂▂▃▁▁▁▂▁▂▂▂▂▂▂▂▁▁▁▁▁▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
53.3 ns Histogram: frequency by time 60.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
You are on an Apple M3/M4?
On my system there is a overhead, but it is much smaller.
I am using Apple M3 Pro. I have updated my first post with that information, along with code results for the fourth order derivative.