Enzyme.jl icon indicating copy to clipboard operation
Enzyme.jl copied to clipboard

Performance of high order derivatives in Enzyme is slower than finite difference

Open Arpit-Babbar opened this issue 11 months ago • 3 comments

Thank you for adding the capability of computing high order derivatives with Enzyme in https://github.com/EnzymeAD/Enzyme.jl/pull/2161!

I benchmarked the performance of Enzyme against finite difference method. For orders greater than 2, I see that finite difference is faster. It is 1.7 times faster for order 3 and 2.4 times faster for order 4. I am sharing my benchmarking code for orders 3 and 4 in case it leads to any ideas for improvement. The mean performance of the third order finite difference method is 8 ns, while that of Enzyme is 14 ns. For the fourth order, finite difference is at 11 ns and Enzyme is at 27 ns.

Benchmarking code for third order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme

# Flux function to be differentiation

@inline function flux(u)
    rho, rho_v1, rho_v2, rho_e = u
    gamma = 1.4
    v1 = rho_v1 / rho
    v2 = rho_v2 / rho
    p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
    f1 = rho_v1
    f2 = rho_v1 * v1 + p
    f3 = rho_v1 * v2
    f4 = (rho_e + p) * v1
    return SVector(f1, f2, f3, f4)
end

# Third order finite difference derivative

function third_derivative_fd(u, du, ddu, dddu)
    factor = 0.5
    df = factor * (flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu)
                   - 2.0 * flux(u + du + 0.5 * ddu + (1.0/6.0) * dddu)
                   + 2.0 * flux(u - du + 0.5 * ddu - (1.0/6.0) * dddu)
                   - flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu))
    return df
end

# AD to compute derivatives

dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
                              DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
                                    DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]

# Random inputs

u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)

@info "Third derivative"
@info "FD"
display(@benchmark third_derivative_fd($u, $du, $ddu, $dddu))

@info "Enzyme"

display(@benchmark dddg_ad($u, $du, $ddu, $dddu))
Benchmarking code for fourth order
using BenchmarkTools: @benchmark
using StaticArrays
using Enzyme

# Flux function to be differentiation

@inline function flux(u)
    rho, rho_v1, rho_v2, rho_e = u
    gamma = 1.4
    v1 = rho_v1 / rho
    v2 = rho_v2 / rho
    p = (gamma - 1) * (rho_e - 0.5f0 * (rho_v1 * v1 + rho_v2 * v2))
    f1 = rho_v1
    f2 = rho_v1 * v1 + p
    f3 = rho_v1 * v2
    f4 = (rho_e + p) * v1
    return SVector(f1, f2, f3, f4)
end

# Fourth order finite difference derivative

function fourth_derivative_fd(u, du, ddu, dddu, ddddu)
    df = (
          flux(u + 2.0 * du + 2.0 * ddu + 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
         - 4.0 * flux(u + du + 0.5 * ddu + 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
         + 6.0 * flux(u)
         - 4.0 * flux(u - du + 0.5 * ddu - 1.0/6.0 * dddu + 1.0/24.0 * ddddu)
         + flux(u - 2.0 * du + 2.0 * ddu - 4.0/3.0 * dddu + 2.0/3.0 * ddddu)
         )
    return df
end


# AD to compute derivatives

dg_ad(x, dx) = autodiff(Forward, flux, DuplicatedNoNeed(x, dx))[1]
ddg_ad(x, dx, ddx) = autodiff(Forward, dg_ad, DuplicatedNoNeed(x, dx),
                              DuplicatedNoNeed(dx, ddx))[1]
dddg_ad(x, dx, ddx, dddx) = autodiff(Forward, ddg_ad, DuplicatedNoNeed(x, dx),
                                    DuplicatedNoNeed(dx, ddx), DuplicatedNoNeed(ddx, dddx))[1]
ddddg_ad(x, dx, ddx, dddx, ddddx) = autodiff(Forward, dddg_ad, DuplicatedNoNeed(x, dx),
                                             DuplicatedNoNeed(dx, ddx),
                                             DuplicatedNoNeed(ddx, dddx),
                                             DuplicatedNoNeed(dddx, ddddx))[1]

# Random inputs

u = SVector(1.0, -0.1, 0.2, 2.0)
du, ddu, dddu, ddddu = (1e-3*SVector(rand(4)...) for _ in 1:4)

@info "Fourth derivative"
@info "FD"
display(@benchmark fourth_derivative_fd($u, $du, $ddu, $dddu, $dddu))

@info "Enzyme"

display(@benchmark ddddg_ad($u, $du, $ddu, $dddu, $dddu))
Benchmarking results for third order
[ Info: Third derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.924 ns … 20.604 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.008 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.140 ns ±  0.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▃▃▂     ▃                                            ▁▁ ▂
  ██████▇▆▆▅▆█▆▆▆▅▃█▇▄▅▅▄▅▄▅▄▃▅▅▁▅▄▄▆▄▃▅▄▃▃▁▄▅▅▃▄▃▄▄▄▅▆▅▆▇██ █
  7.92 ns      Histogram: log(frequency) by time     11.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  13.639 ns … 26.652 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.722 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.037 ns ±  0.980 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃ ▂     ▁▅▂▁                                              ▁
  ███▇█▇▆▆▃▆████▇▅▄▅▇▅▅▆▆▅▄▅▄▁▄▄▅▅▅▄▆▄▆▃▅▃▄▅▄▅▅▆▅▆▄▅▅▄▅▄▅▅▄▆█ █
  13.6 ns      Histogram: log(frequency) by time      19.1 ns <

Benchmarking results for fourth order
[ Info: Fourth derivative
[ Info: FD
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  10.927 ns … 83.625 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.304 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.644 ns ±  2.999 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▃█▂    █▆▃▃▂▁▁▁ ▁▁▁▂▁▁▁▂▁▂▄▂▁▁                             ▂
  ███████████████████████████████▅▅▄▄▅▄▅▅▄▅▄▄▄▄▄▄▅▃▄▁▄▁▄▆▅▅▆▅ █
  10.9 ns      Histogram: log(frequency) by time      20.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Enzyme
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  26.230 ns … 43.257 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.397 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.774 ns ±  1.646 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆  ▁                                                       ▁
  ███▆█▇▆▅▅▆▆▆▆▆▅▅▄▅▆▇▇▆▅▆▅▅▅▅▆▅▅▄▄▄▄▅▅▅▄▅▅▄▆▅▅▆▅▄▄▅▅▅▄▅▄▅▃▄▅ █
  26.2 ns      Histogram: log(frequency) by time      36.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results have been generated with Enzyme v0.13.24 using Apple M3 Pro on Julia 1.11.2.

versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 6 virtual cores)

Here is a gist that checks that the above third and fourth order derivative computations are correct by using a polynomial test case. This gist contains computation and benchmarking of all derivatives up to four.

Arpit-Babbar avatar Jan 06 '25 08:01 Arpit-Babbar

You are on an Apple M3/M4?

On my system there is a overhead, but it is much smaller.

[ Info: FD

BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  25.681 ns … 155.876 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.882 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.268 ns ±   4.361 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▃                                                          ▁
  ███▇▅▃▄▅▅▅▄▄▁▃▃▄▃▄▄▄▁▁▃▃▄▃▄▄▃▁▃▄▄▅▄▅▃▄▄▄▅▅▄▃▅▅▄▅▁▃▄▃▃▄▅▅▆▆▇▇ █
  25.7 ns       Histogram: log(frequency) by time      34.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

[ Info: Enzyme

BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  29.251 ns … 213.054 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.805 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.373 ns ±   6.794 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▃▃▄▃▃▃▃▁ ▁▃▂▂▂▂▂▂▂▂▂▁        ▁▁        ▁                   ▁
  ███████████████████████▇█▆▇▇▆▅████▆▇▇▇▄▇███▇▄█▄▇▅▂▂▃▅▄▃▄▂▄▄▄ █
  29.3 ns       Histogram: log(frequency) by time      49.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vchuravy avatar Jan 06 '25 09:01 vchuravy

For the fourth derivative I start to see the overhead grow:

[ Info: Fourth derivative

[ Info: FD

BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  28.631 ns … 120.559 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.551 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.569 ns ±   2.216 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂  ▃▃█▅ ▃▃▁▃▃▁▃▁▃▃ ▃▂▁▃▁▂▂ ▂▂▂ ▃▂▁▁                          ▂
  ██████████████████████████████▇████▆▅▅▅▅▅▅▃▄▅▅▅▅▅▄▄▁▅▃▅▃▅▄▅█ █
  28.6 ns       Histogram: log(frequency) by time      38.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

[ Info: Enzyme

BenchmarkTools.Trial: 10000 samples with 982 evaluations.
 Range (min … max):  53.267 ns … 92.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.267 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.501 ns ±  1.301 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▄▂▃                                                   
  ▂▂▂▂▃▆████▅▄▃▂▂▃▄▂▂▃▁▁▁▂▁▂▂▂▂▂▂▂▁▁▁▁▁▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  53.3 ns         Histogram: frequency by time        60.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vchuravy avatar Jan 06 '25 09:01 vchuravy

You are on an Apple M3/M4?

On my system there is a overhead, but it is much smaller.

I am using Apple M3 Pro. I have updated my first post with that information, along with code results for the fourth order derivative.

Arpit-Babbar avatar Jan 06 '25 10:01 Arpit-Babbar