streamly icon indicating copy to clipboard operation
streamly copied to clipboard

Bad performance on `Array a` and `MutArray a` while using as `Ptr`

Open TheKK opened this issue 1 year ago • 5 comments

Since I observed write performance issue in my program, I found that it was caused by Array Word8 accidentally.

Here's the benchmark that could reproduce it. https://gist.github.com/TheKK/251fc1fe24600165ce1b2db922f1ac2d

I tried implementing asUnsafePtr without MonadIO m constraint as a reference. Below are the results from my laptop:

± cabal run -O2 bench:cdc -- +RTS -s -T -RTS --regress cycles:iters --regress mutatorCpuSeconds:iters --regress gcCpuSeconds:iters --regress allocated:iters
Up to date
benchmarking write/Array
time                 260.8 μs   (255.9 μs .. 266.8 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 257.1 μs   (255.1 μs .. 259.5 μs)
std dev              7.116 μs   (5.600 μs .. 10.51 μs)
cycles:              0.998 R²   (0.997 R² .. 0.999 R²)
  iters              572419.159 (562076.184 .. 585264.385)
  y                  -1883493.983 (-3370725.280 .. -642113.082)
mutatorCpuSeconds:   0.998 R²   (0.997 R² .. 0.999 R²)
  iters              2.571e-4   (2.521e-4 .. 2.635e-4)
  y                  -7.855e-4  (-1.474e-3 .. -1.975e-4)
gcCpuSeconds:        0.991 R²   (0.984 R² .. 0.995 R²)
  iters              9.898e-6   (9.618e-6 .. 1.014e-5)
  y                  -3.252e-5  (-7.490e-5 .. 1.191e-5)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              693884.116 (693882.854 .. 693885.515)
  y                  2282.946   (1758.990 .. 2816.548)
variance introduced by outliers: 21% (moderately inflated)

benchmarking write/ByteString
time                 146.7 μs   (142.4 μs .. 152.9 μs)
                     0.992 R²   (0.988 R² .. 0.998 R²)
mean                 146.1 μs   (143.6 μs .. 149.5 μs)
std dev              9.761 μs   (7.391 μs .. 12.28 μs)
cycles:              0.992 R²   (0.988 R² .. 0.998 R²)
  iters              321951.335 (312722.537 .. 334991.437)
  y                  -1406715.340 (-3578381.007 .. 136008.990)
mutatorCpuSeconds:   0.993 R²   (0.989 R² .. 0.998 R²)
  iters              1.457e-4   (1.417e-4 .. 1.517e-4)
  y                  -5.871e-4  (-1.524e-3 .. 7.918e-5)
gcCpuSeconds:        0.910 R²   (0.880 R² .. 0.945 R²)
  iters              4.400e-6   (3.879e-6 .. 5.019e-6)
  y                  -6.004e-5  (-1.586e-4 .. 2.950e-5)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              373790.274 (373789.508 .. 373791.086)
  y                  2574.756   (2152.690 .. 2998.360)
variance introduced by outliers: 65% (severely inflated)

benchmarking asPtr/Array
time                 118.8 ns   (116.9 ns .. 120.8 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 118.0 ns   (116.6 ns .. 119.2 ns)
std dev              4.218 ns   (3.329 ns .. 5.294 ns)
cycles:              0.999 R²   (0.998 R² .. 0.999 R²)
  iters              260.828    (256.445 .. 265.217)
  y                  -225886.094 (-556091.679 .. 75566.247)
mutatorCpuSeconds:   0.999 R²   (0.997 R² .. 0.999 R²)
  iters              1.184e-7   (1.164e-7 .. 1.203e-7)
  y                  -8.632e-5  (-2.252e-4 .. 5.290e-5)
gcCpuSeconds:        0.991 R²   (0.986 R² .. 0.997 R²)
  iters              6.501e-10  (6.253e-10 .. 6.798e-10)
  y                  -1.155e-6  (-3.425e-6 .. 8.469e-7)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              312.000    (312.000 .. 312.001)
  y                  2983.914   (2796.086 .. 3182.479)
variance introduced by outliers: 55% (severely inflated)

benchmarking asPtr/Array'
time                 10.91 ns   (10.83 ns .. 10.98 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 10.85 ns   (10.72 ns .. 10.94 ns)
std dev              346.9 ps   (245.6 ps .. 444.4 ps)
cycles:              1.000 R²   (0.999 R² .. 1.000 R²)
  iters              23.951     (23.792 .. 24.105)
  y                  -94395.670 (-226486.084 .. 24880.470)
mutatorCpuSeconds:   1.000 R²   (0.999 R² .. 1.000 R²)
  iters              1.090e-8   (1.082e-8 .. 1.097e-8)
  y                  -3.215e-5  (-8.781e-5 .. 2.575e-5)
gcCpuSeconds:        0.993 R²   (0.989 R² .. 0.997 R²)
  iters              3.418e-11  (3.331e-11 .. 3.534e-11)
  y                  -5.555e-7  (-1.316e-6 .. 1.362e-7)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              16.000     (16.000 .. 16.000)
  y                  2989.826   (2810.229 .. 3178.715)
variance introduced by outliers: 53% (severely inflated)

benchmarking asPtr/ByteString
time                 10.16 ns   (10.07 ns .. 10.26 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 10.10 ns   (10.01 ns .. 10.21 ns)
std dev              330.1 ps   (251.7 ps .. 417.1 ps)
cycles:              0.999 R²   (0.998 R² .. 0.999 R²)
  iters              22.290     (22.099 .. 22.558)
  y                  65680.770  (-130834.465 .. 281810.287)
mutatorCpuSeconds:   0.999 R²   (0.998 R² .. 0.999 R²)
  iters              1.016e-8   (1.007e-8 .. 1.028e-8)
  y                  3.956e-5   (-5.135e-5 .. 1.400e-4)
gcCpuSeconds:        NaN R²     (NaN R² .. NaN R²)
  iters              0.000      (0.000 .. 0.000)
  y                  0.000      (0.000 .. 0.000)
allocated:           0.000 R²   (0.000 R² .. 0.017 R²)
  iters              3.631e-7   (-4.365e-5 .. 4.944e-5)
  y                  2991.437   (2807.870 .. 3177.496)
variance introduced by outliers: 55% (severely inflated)

Marking arr_asPtrUnsafe and ma_asPtrUnsafe as NOINLINE makes benchmark, "asPtr/Array", drop to 15 ns. Still pretty fast comparing to "asPtr/Array" which is 118 ns.

By looking at the CORE, it's obvious that Array.asPtrUnsafe was called by worker wrapper which means one level of indirection.

Main.$wf1
  = \ (ww_sa1F
         :: ghc-prim:GHC.Prim.MutableByteArray# ghc-prim:GHC.Prim.RealWorld)
      (ww1_sa1G :: ghc-prim:GHC.Prim.Int#) ->
      Streamly.Internal.Data.Array.Mut.Type.$wasPtrUnsafe     <== THIS
        @IO
        @Word8
        @()
        Control.Monad.IO.Class.$fMonadIOIO                       <== THIS
        ww_sa1F
        ww1_sa1G
        (Main.main25
         `cast` (<Ptr Word8>_R
                 %<'Many>_N ->_R Sym (ghc-prim:GHC.Types.N:IO[0] <()>_R)
                 :: (Ptr Word8
                     -> ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld
                     -> (# ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld, () #))
                    ~R# (Ptr Word8 -> IO ())))

So this should be a real issue that affect all operations of Array relates to its Ptr interface.

TheKK avatar Sep 19 '23 16:09 TheKK