ad Performance of `Numeric.AD.Mode.Reverse.Double`

I am using AD for gradient-based optimization and need better performance than I am currently getting. I noticed that some work has gone into improving the .Double specializations recently, so I did some experiments with the latest master (85aee3c). My setup is as follows:

{-# LANGUAGE BangPatterns #-}

import Criterion.Main
import Criterion.Types

import Numeric.AD.Mode.Forward
--import Numeric.AD.Mode.Forward.Double
--import Numeric.AD.Mode.Reverse
--import Numeric.AD.Mode.Reverse.Double

{-# INLINE poly #-}
poly :: Num a => a -> a
poly x = go (100000 :: Int) 0 where
    go 0 !a = a
    go n !a = go (n - 1) (a + x ^ n)

main :: IO ()
main = defaultMainWith config [beval, bdiff] where
    config = defaultConfig { regressions = [(["iters"], "allocated")] }
    p = 1.2 :: Double
    beval = bench "eval" $ whnf poly p
    bdiff = bench "diff" $ whnf (diff poly) p

I am using ghc 8.10.5 and llvm 12.0.1 and compiled with -O2 -fllvm. I also set the +ffi switch for the ad package.

I get the following results (full details):

Mode                              Evaluate (alloc)    Differentiate (alloc)
---------------------------------------------------------------------------
Numeric.AD.Mode.Forward           4.107 ms (16 B)     521.4 ms (766.1 MB)
Numeric.AD.Mode.Forward.Double    4.147 ms (16 B)     5.119 ms (88 B)
Numeric.AD.Mode.Reverse           4.638 ms (16 B)     1.168 s  (1.480 GB)
Numeric.AD.Mode.Reverse.Double    4.658 ms (16 B)     220.2 ms (770.2 MB)

Using NCG instead of LLVM, the results are similar, with slightly longer execution times. I am not sure why regular evaluation times also change with different modes.

I am very happy with Numeric.AD.Mode.Forward.Double, as it causes barely any overhead over regular evaluation.

While Numeric.AD.Mode.Reverse.Double is significantly faster than its generic counterpart, its 50x slowdown is still a long shot from the promise of "automatic differentiation typically only decreases performance by a small multiplier". In particular, it allocates a lot of intermediate memory. Since the reverse mode tape is implemented in C via FFI (which I presume is not counted by haskell's GC), I suspect that the 770MB that are allocated indicate that there is still some boxing going on.

Since I am doing gradient-based optimization, I would like to use reverse mode. Am I doing something wrong here? Is there something that can be done to bring its performance more in line with how Numeric.AD.Mode.Forward.Double behaves? Or is this simply a consequence of the additional complexity and bookkeeping of reverse mode ad that just cannot be avoided and is only justified by its better performance for gradients of high dimensionality?

Aug 27 '21 13:08 julmb

Good question! I think the punchline you’ll see is, as you say, the comparisons of forward vs reverse mode will switch when you consider R^n-> R for large enough n. Your example is R-> R, which is where forward mode will shine, as it will also on R-> R^n too algorithmically

Aug 27 '21 17:08 cartazio

I looked into this performance problem a bit. Here's slightly modified benchmark from top post. I used instruction counting. It's easy to measure and almost deterministic and I believe would make decent proxy in this case

{-# LANGUAGE BangPatterns        #-}
{-# LANGUAGE ImportQualifiedPost #-}
import Test.Tasty.PAPI


import Numeric.AD.Mode.Forward        qualified as Fwd
import Numeric.AD.Mode.Forward.Double qualified as FwdD
import Numeric.AD.Mode.Reverse        qualified as Rev
import Numeric.AD.Mode.Reverse.Double qualified as RevD

{-# INLINE poly #-}
poly :: Floating a => a -> a
poly x = go (1000 :: Int) 0 where
    go 0 !a = a
    go n !a = go (n - 1) (a + x ^ n)

main :: IO ()
main = defaultMain
  [ bench "eval" $ whnf poly p
  , bench "Fwd"  $ whnf (Fwd.diff  poly) p
  , bench "FwdD" $ whnf (FwdD.diff poly) p
  , bench "Rev"  $ whnf (Rev.diff  poly) p
  , bench "RevD" $ whnf (RevD.diff poly) p
  ]
  where
    p = 1.02 :: Double

Here are benchmark results for different implementations:

|                    |   ALLOC |  TOT_INS | FP_INS |
|--------------------+---------+----------+--------|
| Plain evaluation   |      48 |   180131 |  12924 |
| Forward mode       | 4230952 | 17584117 |  49698 |
| Forward mode (D)   |      48 |   246745 |  48699 |
| Reverse mode       | 8416680 | 41848179 |  64631 |
|--------------------+---------+----------+--------|
| Reveverse mode (D) |         |          |        |
|--------------------+---------+----------+--------|
| Native tape        | 7970816 | 18637018 |  64625 |
| FFI tape           | 4351648 | 10520859 |  64624 |

What could we understand from this:

GHC is able to unbox everything for Forward.Double. It would be hard to get close.
C tape is ~33% faster than native
Difference between reverse mode with FFI tape and forward mode is 42× or ~800 INS/tape operation (12925 in total). Question is where do they come from?

Apr 10 '23 14:04 Shimuuar

ad ad copied to clipboard

Performance of `Numeric.AD.Mode.Reverse.Double`

ad
ad copied to clipboard