streamly
streamly copied to clipboard
Bad performance on `Array a` and `MutArray a` while using as `Ptr`
Since I observed write performance issue in my program, I found that it was caused by Array Word8
accidentally.
Here's the benchmark that could reproduce it. https://gist.github.com/TheKK/251fc1fe24600165ce1b2db922f1ac2d
I tried implementing asUnsafePtr
without MonadIO m
constraint as a reference. Below are the results from my laptop:
± cabal run -O2 bench:cdc -- +RTS -s -T -RTS --regress cycles:iters --regress mutatorCpuSeconds:iters --regress gcCpuSeconds:iters --regress allocated:iters
Up to date
benchmarking write/Array
time 260.8 μs (255.9 μs .. 266.8 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 257.1 μs (255.1 μs .. 259.5 μs)
std dev 7.116 μs (5.600 μs .. 10.51 μs)
cycles: 0.998 R² (0.997 R² .. 0.999 R²)
iters 572419.159 (562076.184 .. 585264.385)
y -1883493.983 (-3370725.280 .. -642113.082)
mutatorCpuSeconds: 0.998 R² (0.997 R² .. 0.999 R²)
iters 2.571e-4 (2.521e-4 .. 2.635e-4)
y -7.855e-4 (-1.474e-3 .. -1.975e-4)
gcCpuSeconds: 0.991 R² (0.984 R² .. 0.995 R²)
iters 9.898e-6 (9.618e-6 .. 1.014e-5)
y -3.252e-5 (-7.490e-5 .. 1.191e-5)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 693884.116 (693882.854 .. 693885.515)
y 2282.946 (1758.990 .. 2816.548)
variance introduced by outliers: 21% (moderately inflated)
benchmarking write/ByteString
time 146.7 μs (142.4 μs .. 152.9 μs)
0.992 R² (0.988 R² .. 0.998 R²)
mean 146.1 μs (143.6 μs .. 149.5 μs)
std dev 9.761 μs (7.391 μs .. 12.28 μs)
cycles: 0.992 R² (0.988 R² .. 0.998 R²)
iters 321951.335 (312722.537 .. 334991.437)
y -1406715.340 (-3578381.007 .. 136008.990)
mutatorCpuSeconds: 0.993 R² (0.989 R² .. 0.998 R²)
iters 1.457e-4 (1.417e-4 .. 1.517e-4)
y -5.871e-4 (-1.524e-3 .. 7.918e-5)
gcCpuSeconds: 0.910 R² (0.880 R² .. 0.945 R²)
iters 4.400e-6 (3.879e-6 .. 5.019e-6)
y -6.004e-5 (-1.586e-4 .. 2.950e-5)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 373790.274 (373789.508 .. 373791.086)
y 2574.756 (2152.690 .. 2998.360)
variance introduced by outliers: 65% (severely inflated)
benchmarking asPtr/Array
time 118.8 ns (116.9 ns .. 120.8 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 118.0 ns (116.6 ns .. 119.2 ns)
std dev 4.218 ns (3.329 ns .. 5.294 ns)
cycles: 0.999 R² (0.998 R² .. 0.999 R²)
iters 260.828 (256.445 .. 265.217)
y -225886.094 (-556091.679 .. 75566.247)
mutatorCpuSeconds: 0.999 R² (0.997 R² .. 0.999 R²)
iters 1.184e-7 (1.164e-7 .. 1.203e-7)
y -8.632e-5 (-2.252e-4 .. 5.290e-5)
gcCpuSeconds: 0.991 R² (0.986 R² .. 0.997 R²)
iters 6.501e-10 (6.253e-10 .. 6.798e-10)
y -1.155e-6 (-3.425e-6 .. 8.469e-7)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 312.000 (312.000 .. 312.001)
y 2983.914 (2796.086 .. 3182.479)
variance introduced by outliers: 55% (severely inflated)
benchmarking asPtr/Array'
time 10.91 ns (10.83 ns .. 10.98 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 10.85 ns (10.72 ns .. 10.94 ns)
std dev 346.9 ps (245.6 ps .. 444.4 ps)
cycles: 1.000 R² (0.999 R² .. 1.000 R²)
iters 23.951 (23.792 .. 24.105)
y -94395.670 (-226486.084 .. 24880.470)
mutatorCpuSeconds: 1.000 R² (0.999 R² .. 1.000 R²)
iters 1.090e-8 (1.082e-8 .. 1.097e-8)
y -3.215e-5 (-8.781e-5 .. 2.575e-5)
gcCpuSeconds: 0.993 R² (0.989 R² .. 0.997 R²)
iters 3.418e-11 (3.331e-11 .. 3.534e-11)
y -5.555e-7 (-1.316e-6 .. 1.362e-7)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 2989.826 (2810.229 .. 3178.715)
variance introduced by outliers: 53% (severely inflated)
benchmarking asPtr/ByteString
time 10.16 ns (10.07 ns .. 10.26 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 10.10 ns (10.01 ns .. 10.21 ns)
std dev 330.1 ps (251.7 ps .. 417.1 ps)
cycles: 0.999 R² (0.998 R² .. 0.999 R²)
iters 22.290 (22.099 .. 22.558)
y 65680.770 (-130834.465 .. 281810.287)
mutatorCpuSeconds: 0.999 R² (0.998 R² .. 0.999 R²)
iters 1.016e-8 (1.007e-8 .. 1.028e-8)
y 3.956e-5 (-5.135e-5 .. 1.400e-4)
gcCpuSeconds: NaN R² (NaN R² .. NaN R²)
iters 0.000 (0.000 .. 0.000)
y 0.000 (0.000 .. 0.000)
allocated: 0.000 R² (0.000 R² .. 0.017 R²)
iters 3.631e-7 (-4.365e-5 .. 4.944e-5)
y 2991.437 (2807.870 .. 3177.496)
variance introduced by outliers: 55% (severely inflated)
Marking arr_asPtrUnsafe
and ma_asPtrUnsafe
as NOINLINE makes benchmark, "asPtr/Array", drop to 15 ns. Still pretty fast comparing to "asPtr/Array" which is 118 ns.
By looking at the CORE, it's obvious that Array.asPtrUnsafe
was called by worker wrapper which means one level of indirection.
Main.$wf1
= \ (ww_sa1F
:: ghc-prim:GHC.Prim.MutableByteArray# ghc-prim:GHC.Prim.RealWorld)
(ww1_sa1G :: ghc-prim:GHC.Prim.Int#) ->
Streamly.Internal.Data.Array.Mut.Type.$wasPtrUnsafe <== THIS
@IO
@Word8
@()
Control.Monad.IO.Class.$fMonadIOIO <== THIS
ww_sa1F
ww1_sa1G
(Main.main25
`cast` (<Ptr Word8>_R
%<'Many>_N ->_R Sym (ghc-prim:GHC.Types.N:IO[0] <()>_R)
:: (Ptr Word8
-> ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld
-> (# ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld, () #))
~R# (Ptr Word8 -> IO ())))
So this should be a real issue that affect all operations of Array
relates to its Ptr
interface.