accelerate Speed up test suite

Speed up test suite

Open tmcdonell opened this issue 6 years ago • 10 comments

The standard accelerate test suite, used by all the backends, can be quite slow. Several of the tests are significantly slower than the others, for example segmented folds and scans, which I believe is because the reference implementations are very inefficient. Writing some more efficient reference implementations (e.g. using Data.Vector.Unboxed) should help speed things up.

Jan 15 '18 12:01 tmcdonell

The doctests take longer than the nofib tests on my machine. I think that is because it has to recompile all modules before it can run the doctests.

Nov 07 '19 18:11 noughtmare

Blackscholes is the test that takes the longest on my machine (about 2 seconds). It doesn't seem like it would benefit from using Data.Vector.Unboxed because it is basically only a single map over the array. Maybe the size of the randomly generated arrays can be reduced. Is it really necessary to generate arrays of up to 16384 elements? Reducing it to 1024 makes it run in about 0.1 seconds.

Otherwise maybe the generation of the array can be sped up. Now, it first generates a list and then converts that to an array (maybe some of this is fused away?).

Nov 07 '19 21:11 noughtmare

Generating large test arrays is important for the parallel backends, because they can use different implementations for larger arrays (for example the GPU backends need to use multiple thread blocks after a certain (hardware dependent) size).

It definitely could be a problem with the random number/array generation, we should look into that as well.

Nov 08 '19 16:11 tmcdonell

See the big edit at the bottom

So, I think we could generate random arrays much more quickly using generate and for example the murmur hash of the toIndexed shape that is given in the generate function with some random seed. Here's an implementation:

import qualified Prelude
import Data.Array.Accelerate
import Data.Array.Accelerate.Data.Bits
import Data.Function ((&))

rotl64 :: Exp Word64 -> Exp Int -> Exp Word64
rotl64 x r = x `unsafeShiftL` r .|. x `unsafeShiftR` (64 - r)

fmix64 :: Exp Word64 -> Exp Word64
fmix64 key = key
  & (\x -> x `xor` (x `unsafeShiftR` 33))
  & (* 0xff51afd7ed558ccd)
  & (\x -> x `xor` (x `unsafeShiftR` 33))
  & (* 0xc4ceb9fe1a85ec53)
  & (\x -> x `xor` (x `unsafeShiftR` 33))

murmur :: Exp Word64 -> Exp Word64 -> Exp Word64
murmur seed key = key
  & (* 0x87c37b91114253d5)
  & (\x -> rotl64 x 31)
  & (* 0x4cf5ad432745937f)
  & xor seed
  & fmix64

random :: Shape sh => Exp Word64 -> Exp sh -> Acc (Array sh Word64)
random seed sh = generate sh (murmur seed . fromIntegral . toIndex sh)

This only produces uniformly pseudo random Word64 but I think it should not be too difficult to convert that to some other types and distributions.

This is the source code of murmur that I adapted in the code above: https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L289-L331

Here the randomness of murmur is discussed: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633 (This analysis might not be completely accurate since I changed the murmur code slightly)

Another problem is that murmur is not intended for such short inputs, so it might be better to look for some other hash function, but I could not find any that were specifically intended for this purpose. Edit: This post might help, but I haven't read it completely yet: https://mostlymangling.blogspot.com/2018/07/on-mixing-functions-in-fast-splittable.html

Edit: I just realized that of course random array generation already exists in accelerate. I looked at mwc-random-accelerate (specifically: https://github.com/tmcdonell/mwc-random-accelerate/blob/e840871e2edbc583bc90230b1bb9d9452e89d3d6/Data/Array/Accelerate/System/Random/MWC.hs#L133). It seems that that uses a sequential random generation method, which is probably much slower than the method I describe here, so there may still be some use for this method.

Big edit: So, I have read the blogpost about strong mixers. In his latest post Pelle Evensen introduces a new mixing algorithm called nasam. Nasam passes the extremely strong RNG benchmark called PractRand, which mwc doesn't pass. Here is an implementation of the strongest, but slowest version:

import qualified Prelude
import Data.Array.Accelerate
import Data.Array.Accelerate.Data.Bits
import Data.Function ((&))

xnasamx :: Exp Word64 -> Exp Word64 -> Exp Word64
xnasamx x c = x
  & xor c
  & (\x -> x `xor` rotateR x 25 `xor` rotateR x 47)
  & (* 0x9E6C63D0676A9A99)
  & (\x -> x `xor` (x `unsafeShiftR` 23) `xor` (x `unsafeShiftR` 51))
  & (* 0x9E6D62D06F6A9A9B)
  & (\x -> x `xor` (x `unsafeShiftR` 23) `xor` (x `unsafeShiftR` 51))
  & xor c

random :: Shape sh => Exp Word64 -> Exp sh -> Acc (Array sh Word64)
random seed sh = generate sh ((\x -> xnasamx x seed) . fromIntegral . toIndex sh)

This should beat the mwc-random-accelerate package by speed (for larger arrays) and by quality of produced random numbers. The blog post doesn't mention any specific criteria for the constant c. To ensure high quality output the seed could be generated with a cryptographically secure random number generator. But I think that might be overkill.

Edit: The website https://www.pcg-random.org contains lots of very useful information about random number generators. PCG itself is not directly suitable for parallel random number generation, because it still uses a step function that is not simply incrementing the state. But they do use a permutation on the state to produce their final output, similar to the mixing functions I described here. So, their website and paper contains more information about this topic.

Jun 19 '20 08:06 noughtmare

I have some benchmark results:

benchmarking xnasamx Native/1000
time                 49.86 μs   (47.14 μs .. 52.45 μs)
                     0.982 R²   (0.980 R² .. 0.986 R²)
mean                 49.14 μs   (47.46 μs .. 50.94 μs)
std dev              5.282 μs   (4.747 μs .. 5.673 μs)
variance introduced by outliers: 85% (severely inflated)

benchmarking xnasamx Native/10000
time                 98.84 μs   (97.27 μs .. 99.92 μs)
                     0.997 R²   (0.995 R² .. 0.998 R²)
mean                 93.58 μs   (90.89 μs .. 95.33 μs)
std dev              6.891 μs   (4.980 μs .. 8.659 μs)
variance introduced by outliers: 71% (severely inflated)

benchmarking xnasamx Native/100000
time                 217.7 μs   (215.3 μs .. 221.2 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 236.6 μs   (232.4 μs .. 242.4 μs)
std dev              17.72 μs   (15.17 μs .. 20.12 μs)
variance introduced by outliers: 67% (severely inflated)

-----------------------------------------------------------

benchmarking xnasamx PTX/1000
time                 157.8 μs   (155.6 μs .. 159.5 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 152.7 μs   (150.2 μs .. 154.4 μs)
std dev              6.266 μs   (5.181 μs .. 8.138 μs)
variance introduced by outliers: 40% (moderately inflated)

benchmarking xnasamx PTX/10000
time                 204.7 μs   (198.3 μs .. 212.4 μs)
                     0.992 R²   (0.987 R² .. 0.997 R²)
mean                 192.1 μs   (187.0 μs .. 197.3 μs)
std dev              16.46 μs   (13.68 μs .. 21.56 μs)
variance introduced by outliers: 74% (severely inflated)

benchmarking xnasamx PTX/100000
time                 426.1 μs   (407.5 μs .. 446.5 μs)
                     0.975 R²   (0.964 R² .. 0.983 R²)
mean                 454.9 μs   (431.9 μs .. 485.4 μs)
std dev              95.93 μs   (69.95 μs .. 128.1 μs)
variance introduced by outliers: 94% (severely inflated)

-----------------------------------------------------------

benchmarking mwc/1000
time                 892.0 μs   (879.9 μs .. 912.1 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 915.9 μs   (905.6 μs .. 924.5 μs)
std dev              33.66 μs   (29.50 μs .. 38.03 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarking mwc/10000
time                 8.391 ms   (8.345 ms .. 8.441 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 8.402 ms   (8.364 ms .. 8.445 ms)
std dev              108.1 μs   (80.22 μs .. 137.1 μs)

benchmarking mwc/100000
time                 84.82 ms   (83.63 ms .. 86.17 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 83.55 ms   (82.74 ms .. 84.38 ms)
std dev              1.389 ms   (1.028 ms .. 1.941 ms)

I had to modify mwc-random-accelerate for it to work for accelerate version 1.4: https://github.com/tmcdonell/mwc-random-accelerate/pull/3

Here's the code:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE TypeApplications #-}
module Main where

import Internal

import System.Random (randomIO)
import qualified Data.Array.Accelerate.LLVM.PTX as PTX
import qualified Data.Array.Accelerate.LLVM.Native as Native
import qualified Data.Array.Accelerate as A
import qualified Data.Array.Accelerate.System.Random.MWC as MWC
import Data.Word
import Criterion.Main

randomNative :: Word64 -> Int -> A.Array A.DIM1 Word64
randomNative seed n =
  $(let f :: A.Acc (A.Scalar Word64) -> A.Acc (A.Scalar A.DIM1) -> A.Acc (A.Array A.DIM1 Word64)
        f seed sh = A.compute (random (A.the seed) (A.the sh))
    in Native.runQ f)
    (A.fromList A.Z [seed])
    (A.fromList A.Z [(A.Z A.:. n)] :: A.Scalar A.DIM1)

randomPTX :: Word64 -> Int -> A.Array A.DIM1 Word64
randomPTX seed n =
  $(let f :: A.Acc (A.Scalar Word64) -> A.Acc (A.Scalar A.DIM1) -> A.Acc (A.Array A.DIM1 Word64)
        f seed sh = A.compute (random (A.the seed) (A.the sh))
    in PTX.runQ f)
    (A.fromList A.Z [seed])
    (A.fromList A.Z [(A.Z A.:. n)] :: A.Scalar A.DIM1)

bench10 :: (Int -> Benchmarkable) -> [Benchmark]
bench10 f = 
    [ bench "1000"    $ f 1000
    , bench "10000"   $ f 10000
    , bench "100000"  $ f 100000
    ]

main :: IO ()
main = do
  seed <- randomIO
  defaultMain
    [ bgroup "xnasamx Native"
      $ bench10 (whnf ((\arr -> A.linearIndexArray arr 0) . randomNative seed))
    , bgroup "xnasamx PTX"
      $ bench10 (whnf ((\arr -> A.linearIndexArray arr 0) . randomPTX seed))
    , bgroup "mwc" 
      $ bench10 (whnfIO . fmap (\arr -> A.linearIndexArray arr 0) . MWC.randomArray (MWC.uniform @_ @Word64) . (A.Z A.:.))
    ]

Where Internal contains the above nasam mixer code.

By the way, how can I properly force array computation? I have now used linearIndexArray, but that might not be the best way. I would like to avoid another pass over the entire array.

Edit: The benchmark above mostly measures the time it takes to allocate memory... running xnasamx in a tight strict loop in a single thread yields the following results:

benchmarking xnasamxgo/1000
time                 307.6 ns   (307.0 ns .. 308.3 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 308.1 ns   (307.3 ns .. 308.8 ns)
std dev              2.314 ns   (1.939 ns .. 2.798 ns)

benchmarking xnasamxgo/10000
time                 2.975 μs   (2.965 μs .. 2.993 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.973 μs   (2.968 μs .. 2.984 μs)
std dev              26.83 ns   (16.03 ns .. 47.94 ns)

benchmarking xnasamxgo/100000
time                 29.60 μs   (29.54 μs .. 29.69 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.59 μs   (29.53 μs .. 29.66 μs)
std dev              225.6 ns   (184.8 ns .. 287.0 ns)

Jun 22 '20 10:06 noughtmare

Hi @noughtmare, sorry for the slow response.

This looks great! I wonder will this will work together with heghehog's notion of random number generation; that is, shrinking on failure? At any rate, it looks like this could be used as the basis for a fast random number generation package (as opposed to mwc-random-accelerate, where everything happens on the CPU).

Jul 02 '20 12:07 tmcdonell

We might be able to use the recently announced doctest-extract package to improve the speed of the doctests.

About the hedgehog tests, yeah I've looked into it a little bit and it doesn't look very easy to integrate this way of generating random numbers into hedgehog. It also might be a little bit like a bootstrapping situation where accelerate is tested on arrays that are generated by accelerate code.

Aug 03 '20 17:08 noughtmare

I have pulled one of the PractRand generators out into the sfc-random-accelerate package.

Sep 03 '20 08:09 tmcdonell

That looks great, probably even better than that nasam function. I didn't know PractRand included those generators.

Sep 10 '20 17:09 noughtmare

I just tried doctest-parallel and it seems to speed up the doctests from 45 seconds to 9 seconds on my machine (after cabal build has been run at least once). But I did encounter an issue with imports.

Dec 24 '21 16:12 noughtmare

accelerate accelerate copied to clipboard

Speed up test suite

accelerate
accelerate copied to clipboard