ad
ad copied to clipboard
Performance of `Numeric.AD.Mode.Reverse.Double`
I am using AD for gradient-based optimization and need better performance than I am currently getting. I noticed that some work has gone into improving the .Double
specializations recently, so I did some experiments with the latest master (85aee3c). My setup is as follows:
{-# LANGUAGE BangPatterns #-}
import Criterion.Main
import Criterion.Types
import Numeric.AD.Mode.Forward
--import Numeric.AD.Mode.Forward.Double
--import Numeric.AD.Mode.Reverse
--import Numeric.AD.Mode.Reverse.Double
{-# INLINE poly #-}
poly :: Num a => a -> a
poly x = go (100000 :: Int) 0 where
go 0 !a = a
go n !a = go (n - 1) (a + x ^ n)
main :: IO ()
main = defaultMainWith config [beval, bdiff] where
config = defaultConfig { regressions = [(["iters"], "allocated")] }
p = 1.2 :: Double
beval = bench "eval" $ whnf poly p
bdiff = bench "diff" $ whnf (diff poly) p
I am using ghc 8.10.5 and llvm 12.0.1 and compiled with -O2 -fllvm
. I also set the +ffi
switch for the ad package.
I get the following results (full details):
Mode Evaluate (alloc) Differentiate (alloc)
---------------------------------------------------------------------------
Numeric.AD.Mode.Forward 4.107 ms (16 B) 521.4 ms (766.1 MB)
Numeric.AD.Mode.Forward.Double 4.147 ms (16 B) 5.119 ms (88 B)
Numeric.AD.Mode.Reverse 4.638 ms (16 B) 1.168 s (1.480 GB)
Numeric.AD.Mode.Reverse.Double 4.658 ms (16 B) 220.2 ms (770.2 MB)
Using NCG instead of LLVM, the results are similar, with slightly longer execution times. I am not sure why regular evaluation times also change with different modes.
I am very happy with Numeric.AD.Mode.Forward.Double
, as it causes barely any overhead over regular evaluation.
While Numeric.AD.Mode.Reverse.Double
is significantly faster than its generic counterpart, its 50x slowdown is still a long shot from the promise of "automatic differentiation typically only decreases performance by a small multiplier". In particular, it allocates a lot of intermediate memory. Since the reverse mode tape is implemented in C via FFI (which I presume is not counted by haskell's GC), I suspect that the 770MB that are allocated indicate that there is still some boxing going on.
Since I am doing gradient-based optimization, I would like to use reverse mode. Am I doing something wrong here? Is there something that can be done to bring its performance more in line with how Numeric.AD.Mode.Forward.Double
behaves? Or is this simply a consequence of the additional complexity and bookkeeping of reverse mode ad that just cannot be avoided and is only justified by its better performance for gradients of high dimensionality?
Good question! I think the punchline you’ll see is, as you say, the comparisons of forward vs reverse mode will switch when you consider R^n-> R for large enough n. Your example is R-> R, which is where forward mode will shine, as it will also on R-> R^n too algorithmically
I looked into this performance problem a bit. Here's slightly modified benchmark from top post. I used instruction counting. It's easy to measure and almost deterministic and I believe would make decent proxy in this case
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE ImportQualifiedPost #-}
import Test.Tasty.PAPI
import Numeric.AD.Mode.Forward qualified as Fwd
import Numeric.AD.Mode.Forward.Double qualified as FwdD
import Numeric.AD.Mode.Reverse qualified as Rev
import Numeric.AD.Mode.Reverse.Double qualified as RevD
{-# INLINE poly #-}
poly :: Floating a => a -> a
poly x = go (1000 :: Int) 0 where
go 0 !a = a
go n !a = go (n - 1) (a + x ^ n)
main :: IO ()
main = defaultMain
[ bench "eval" $ whnf poly p
, bench "Fwd" $ whnf (Fwd.diff poly) p
, bench "FwdD" $ whnf (FwdD.diff poly) p
, bench "Rev" $ whnf (Rev.diff poly) p
, bench "RevD" $ whnf (RevD.diff poly) p
]
where
p = 1.02 :: Double
Here are benchmark results for different implementations:
| | ALLOC | TOT_INS | FP_INS |
|--------------------+---------+----------+--------|
| Plain evaluation | 48 | 180131 | 12924 |
| Forward mode | 4230952 | 17584117 | 49698 |
| Forward mode (D) | 48 | 246745 | 48699 |
| Reverse mode | 8416680 | 41848179 | 64631 |
|--------------------+---------+----------+--------|
| Reveverse mode (D) | | | |
|--------------------+---------+----------+--------|
| Native tape | 7970816 | 18637018 | 64625 |
| FFI tape | 4351648 | 10520859 | 64624 |
What could we understand from this:
- GHC is able to unbox everything for Forward.Double. It would be hard to get close.
- C tape is ~33% faster than native
- Difference between reverse mode with FFI tape and forward mode is 42× or ~800 INS/tape operation (12925 in total). Question is where do they come from?